# Proposal: Scheduler Evolution

capOS should evolve its scheduler in layers. The goal is not one clever
algorithm; it is a capability-shaped CPU subsystem that scales ordinary work,
admits realtime islands, allows service/runtime-specific policy, and preserves
a small auditable kernel dispatch path.

This proposal complements, rather than replaces,
[Tickless and Realtime Scheduling](tickless-realtime-scheduling-proposal.md).
That proposal owns timer/tickless/SQPOLL-nohz details. This proposal owns the
broader scheduler architecture and roadmap.

## Design Grounding

Local grounding:

- [Scheduling](../architecture/scheduling.md)
- [SMP Phase C](../backlog/smp-phase-c.md)
- [SMP](smp-proposal.md)
- [Ring v2 For Full SMP](ring-v2-smp-proposal.md)
- [Tickless and Realtime Scheduling](tickless-realtime-scheduling-proposal.md)
- [Stateful Task and Job Graphs](stateful-task-job-graphs-proposal.md)
- [Future Scheduler Architecture](../research/future-scheduler-architecture.md)
- [NO_HZ, SQPOLL, and Realtime Scheduling](../research/nohz-sqpoll-realtime.md)
- [Out-of-kernel scheduling](../research/out-of-kernel-scheduling.md)
- [Completion rings and threaded runtimes](../research/completion-ring-threading.md)
- [Multimedia pipeline latency](../research/multimedia-pipeline-latency.md)
- [Robotics realtime control](../research/robotics-realtime-control.md)

## Goals

- Keep protected dispatch, budget enforcement, interrupt handling, and idle in
  the kernel.
- Replace the single global runnable queue with per-CPU runnable ownership and
  bounded cross-CPU wake/migration.
- Add CPU accounting before adopting policy that depends on runtime charge.
- Make ordinary best-effort scheduling fair by virtual time, with EEVDF-like
  virtual-deadline scheduling as the target after accounting exists.
- Represent admitted CPU time as `SchedulingContext` capability authority.
- Represent isolated CPU ownership as `CpuIsolationLease` authority.
- Support user-space scheduler policy services for admission and tuning
  without putting user-space calls on every dispatch path.
- Provide enough telemetry to distinguish scheduler cost, serial/MMIO logging,
  TLB/CR3 effects, QEMU/KVM artifacts, and workload contention.

## Non-Goals

- Do not import Linux CFS/EEVDF, FreeBSD ULE, or sched_ext as code.
- Do not expose arbitrary user-supplied scheduler programs in the kernel in
  the near term.
- Do not make a user-space process the mandatory next-thread dispatcher.
- Do not claim hard realtime until admission, budget enforcement, IRQ/device
  behavior, kernel-path latency, and WCET evidence exist.
- Do not make nohz/full-nohz a thread flag. It is a CPU lease plus scheduler
  proof.

## Architecture

The target scheduler has four layers:

- Kernel mechanism: per-CPU run queues, current-thread state, idle, context
  switch, cross-CPU wake/migration, timer/IPI handling, CPU accounting, budget
  enforcement, and timeout/depletion faults.
- Kernel policy primitives: best-effort weights, virtual deadlines,
  scheduling contexts, CPU masks, isolation leases, direct IPC donation, and
  realtime-island hooks.
- Privileged scheduler policy service: admission, budget/profile selection,
  CPU partitioning, isolation grants, service/runtime hints, policy reload,
  and operator diagnostics.
- Application/runtime schedulers: work stealing, actors, async reactors,
  language M:N schedulers, request queues, and service-local priority and
  batching.

The hot path remains local and bounded: timer interrupt or wakeup, charge
runtime, update runnable state, pick from a per-CPU queue or a bounded steal
path, switch context. User-space policy participates at slower boundaries:
profile changes, thread/process creation, budget depletion, realtime admission,
lease grant/revoke, or explicit operator policy updates.

Stateful task/job graph coordinators sit above these layers. They may own
graph node queues, leases, retry state, cancellation, and assignment metadata,
but they do not own CPU dispatch. A graph node's `priority`, `deadline`,
`budget`, or `queue` field is workload policy until a capability-authorized
scheduler policy service maps it to a weight, scheduling context, CPU lease,
or request deadline.

## Stage 0: Evidence Before Policy

Before changing the default policy, finish the active thread-scale attribution
work:

- first-pass scheduler candidate/outcome, reschedule-IPI, serial-byte, and
  scheduler-lock counters, now present behind
  `CAPOS_THREAD_SCALE_GUEST_MEASURE=1` and confirmed on `capos-bench`;
- timer interrupt counters per case, now present in the same guest-measure path;
- CR3/TLB event counts per case, now present in the same guest-measure path;
- guest-symbol or guest-PC samples;
- cacheline/workload A/B baseline.
- logging-suppression A/B evidence for per-context-switch diagnostic output.

This protects the design from treating QEMU/KVM, serial MMIO, or benchmark
cache contention as a scheduler algorithm problem.

## Stage 1: Per-CPU Runnable Ownership

Split the scheduler's runnable state first. The accepted initial shape has
per-CPU run queues with a runnable `ThreadRef` deque or priority buckets,
current-thread state, a local reschedule flag, and local counters. Shared
scheduler state keeps process/thread metadata, sleeping/deadline waiters,
blocked waiters, migration records, and the global policy epoch.

Rules:

- A runnable `ThreadRef` is owned by exactly one CPU queue at a time.
- Cross-CPU wake enqueues to the target CPU or a policy-selected CPU and sends
  a bounded reschedule IPI when needed.
- Migration removes from one owner before publishing to another.
- Idle CPUs steal only through bounded policy, not by scanning every process.
- Process exit and thread exit keep cleanup bounded and must not allocate in
  interrupt, cancellation, or emergency paths.

This stage may still use round-robin within each CPU queue. The objective is
SMP structure and evidence, not perfect fairness.

## Stage 2: CPU Accounting

Add a monotonic runtime charge model. `ThreadCpuAccount` records runtime,
last-start time, virtual runtime, context switches, preemptions, and voluntary
blocks. `SchedEntity` records weight, latency class, eligible time, and virtual
deadline.

Accounting must be stable enough to support fair scheduling, quotas, and
future scheduling contexts. It must account context switches, blocking
syscalls, endpoint direct handoff, timer preemption, thread exit, and idle.

Where exact cycle attribution is not yet credible, the implementation should
label the metric as diagnostic rather than enforcing policy from it.

## Stage 3: Best-Effort Fair Policy

Once per-CPU queues and accounting exist, capOS should move ordinary
best-effort scheduling from FIFO round-robin toward virtual-time fairness.

The target policy is EEVDF-like:

- runnable entities accrue lag against their fair share;
- eligible entities are ordered by virtual deadline;
- weights affect virtual runtime/deadline progression;
- latency-sensitive best-effort entities can request smaller slices within
  policy limits;
- migration preserves accounting so moving CPUs does not reset fairness.

Initial implementation may use a simpler weighted fair queue if that gets
runtime accounting and migration correct sooner. The proposal should not
freeze an exact tree/heap representation until the per-CPU queue and accounting
measurements exist.

## Stage 4: Scheduling Contexts

CPU-time authority becomes a capability. `SchedulingContext` records budget,
period, relative deadline, priority or criticality, CPU mask, remaining
budget, replenishment state, timeout endpoint, and overrun policy.

Kernel responsibilities:

- enforce budget and replenishment;
- prevent a thread without eligible CPU authority from running;
- charge runtime to exactly one authority target;
- donate and return scheduling contexts across synchronous endpoint paths;
- post timeout/depletion notifications without allocating in emergency paths.

Policy-service responsibilities:

- admit or reject scheduling contexts;
- choose budget/period/priority;
- bind contexts to threads/services;
- revoke or adjust contexts safely;
- record operator-visible decisions.

`SQE.deadline_ns` remains request metadata. It may influence drop, freshness,
propagation, and telemetry, but it does not grant CPU budget.

## Stage 5: CPU Isolation Leases and SQPOLL

`CpuIsolationLease` grants placement and exclusivity, not CPU time. It records
the owner process/session/service, CPU set, mode, housekeeping exclusions,
accounting target, maximum revocation latency, and revoke endpoint.

Activation requires scheduler proof:

- at least one housekeeping CPU remains online;
- unrelated timers, deferred cleanup, network polling, and debug/watchdog work
  are not pinned to the isolated CPU;
- the active ring has exactly one SQ consumer;
- the accounting target is valid and chargeable;
- revocation latency fits the lease policy.

SQPOLL uses the ring-mode contract in
[Tickless and Realtime Scheduling](tickless-realtime-scheduling-proposal.md).
The scheduler proposal adds the CPU-ownership and policy-service side of that
contract.

## Stage 6: Realtime Islands

A `RealtimeIsland` is an admitted graph, not a single priority. It records
scheduling contexts, memory reservations, device and IRQ reservations,
rings/endpoints/notifications, any CPU isolation leases, admission evidence,
and overrun/shutdown policy.

Use cases include local audio, realtime voice, robotics control, and selected
provider/runtime loops. Admission must fail closed if the graph cannot fit
the declared period/quantum and reservations.

## Stage 7: User-Space Scheduler Policy

After kernel primitives are in place, a privileged scheduler policy service can
own:

- default resource profiles;
- session/account/service CPU policy;
- scheduling-context admission;
- CPU lease grant/revoke;
- runtime hints such as latency-sensitive, batch, driver, poller, or agent;
- operator-facing diagnostics and policy reload.

The kernel still owns emergency fallback. If the policy service is dead,
blocked, stale, or malicious, the kernel must continue to enforce safety,
revoke leases as policy permits, and schedule a minimal recovery path.

## Validation Gates

- Per-CPU queue work must preserve `run-smoke`, `run-spawn`,
  `run-thread-scale`, park/ring/process-exit smokes, and SMP smokes.
- A thread-scale milestone closeout must include repeated controlled
  `capos-bench` evidence and raw logs.
- CPU accounting must include sanity tests that measured runtime increases
  monotonically while a thread runs and stops while it is blocked.
- Fair policy changes must include adversarial tests: CPU hogs, short
  sleepers, direct IPC handoff, multi-process load, and same-process sibling
  load.
- Scheduling-context work must include admission rejection, budget depletion,
  replenishment, endpoint donation/return, timeout notification, and stale cap
  revocation tests.
- CPU leases must include revocation, process exit, session close, and
  housekeeping fallback tests.
- Realtime island proofs must show preallocation, no allocation/blocking on
  admitted paths, deadline miss telemetry, and fail-closed overrun behavior.

## Open Decisions

- Whether the first best-effort fair policy should be weighted fair queueing or
  direct EEVDF.
- Whether scheduling-context priority is a scalar, a criticality band, or both.
- Whether `SchedulingContext` should be bindable to a process default,
  individual thread, endpoint call path, or all three in the first ABI.
- Which scheduler telemetry is permanent ABI and which is benchmark-only.
- How much policy-service state belongs in the boot manifest versus mutable
  operator configuration.