Proposal: Scheduler Evolution
capOS should evolve its scheduler in layers. The goal is not one clever algorithm; it is a capability-shaped CPU subsystem that scales ordinary work, admits realtime islands, allows service/runtime-specific policy, and preserves a small auditable kernel dispatch path.
This proposal complements, rather than replaces, Tickless and Realtime Scheduling. That proposal owns timer/tickless/SQPOLL-nohz details. This proposal owns the broader scheduler architecture and roadmap.
Design Grounding
Local grounding:
- Scheduling
- In-Process Threading Contract
- Design Risks Register, Q9 – CPU accounting and scheduling contexts
- SMP Phase C
- SMP
- Ring v2 For Full SMP
- Tickless and Realtime Scheduling
- Stateful Task and Job Graphs
- Future Scheduler Architecture
- NO_HZ, SQPOLL, and Realtime Scheduling
- Out-of-kernel scheduling
- Completion rings and threaded runtimes
- Multimedia pipeline latency
- Robotics realtime control
Goals
- Keep protected dispatch, budget enforcement, interrupt handling, and idle in the kernel.
- Replace the single global runnable queue with per-CPU runnable ownership and bounded cross-CPU wake/migration.
- Add CPU accounting before adopting policy that depends on runtime charge.
- Make ordinary best-effort scheduling fair by virtual time, with EEVDF-like virtual-deadline scheduling as the target after accounting exists.
- Represent admitted CPU time as
SchedulingContextcapability authority. - Represent isolated CPU ownership as
CpuIsolationLeaseauthority. - Support user-space scheduler policy services for admission and tuning without putting user-space calls on every dispatch path.
- Provide enough telemetry to distinguish scheduler cost, serial/MMIO logging, TLB/CR3 effects, QEMU/KVM artifacts, and workload contention.
Full-SMP Scalability Focus
The scheduler work after the current Phase F chain should be judged by whether capOS can keep useful throughput and bounded scheduling overhead on 16/32-core machines, not by another small QEMU-only speedup row. The SMP proposal owns CPU bring-up and APIC/TLB substrate; this proposal owns the scheduler changes needed to make that substrate useful at higher core counts.
The scheduler side of the milestone should include:
- dynamic scheduler CPU sets derived from discovered topology instead of the temporary four-owner mask;
- per-CPU run queues and current-thread state that do not require one shared lock for ordinary local pick/requeue paths;
- narrower shared metadata locks for process/thread lookup, blocking waiters, exit cleanup, direct IPC handoff, and timer/deadline waiters;
- bounded cross-CPU wakeup and migration that records target, source, steal, reschedule-IPI, and failed-placement counters;
- topology-aware placement that separates physical cores, SMT siblings, and later NUMA/cache groups;
- total-time accounting for spawn/join/exit and service-bound workloads, not only syscall-free work windows;
- hardware-run artifacts that include native Linux baselines on the same machine and QEMU rows only as regression or virtualization context.
The benchmark shape should include static map/reduce, uneven dynamic tasks, barrier-heavy phase loops, independent processes, same-process threads, and a capability-call/service-bound workload. That matrix is intentionally broader than the old thread-scale checksum row because high core counts often expose lock convoying, wakeup storms, timer/IPI cost, TLB-shootdown scaling, and runtime lifecycle overhead before pure compute saturates.
Non-Goals
- Do not import Linux CFS/EEVDF, FreeBSD ULE, or sched_ext as code.
- Do not expose arbitrary user-supplied scheduler programs in the kernel in the near term.
- Do not make a user-space process the mandatory next-thread dispatcher.
- Do not claim hard realtime until admission, budget enforcement, IRQ/device behavior, kernel-path latency, and WCET evidence exist.
- Do not make nohz/full-nohz a thread flag. It is a CPU lease plus scheduler proof.
Architecture
The target scheduler has four layers:
- Kernel mechanism: per-CPU run queues, current-thread state, idle, context switch, cross-CPU wake/migration, timer/IPI handling, CPU accounting, budget enforcement, and timeout/depletion faults.
- Kernel policy primitives: best-effort weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, direct IPC donation, and realtime-island hooks.
- Privileged scheduler policy service: admission, budget/profile selection, CPU partitioning, isolation grants, service/runtime hints, policy reload, and operator diagnostics.
- Application/runtime schedulers: work stealing, actors, async reactors, language M:N schedulers, request queues, and service-local priority and batching.
The hot path remains local and bounded: timer interrupt or wakeup, charge runtime, update runnable state, pick from a per-CPU queue or a bounded steal path, switch context. User-space policy participates at slower boundaries: profile changes, thread/process creation, budget depletion, realtime admission, lease grant/revoke, or explicit operator policy updates.
Stateful task/job graph coordinators sit above these layers. They may own
graph node queues, leases, retry state, cancellation, and assignment metadata,
but they do not own CPU dispatch. A graph node’s priority, deadline,
budget, or queue field is workload policy until a capability-authorized
scheduler policy service maps it to a weight, scheduling context, CPU lease,
or request deadline.
Stage 0: Evidence Before Policy
Before changing the default policy, the active thread-scale attribution work must keep policy conclusions separated from benchmark artifacts. Current mainline evidence now includes:
- scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock,
timer interrupt, and CR3/TLB counters behind
CAPOS_THREAD_SCALE_GUEST_MEASURE=1; - raw guest-PC samples for user-mode timer preemption points;
- logging-suppression A/B evidence through
CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1; - exact native Linux pthread baseline evidence, including compact-versus-padded result-slot diagnostics;
- larger-workload/Amdahl evidence through
CAPOS_THREAD_SCALE_TOTAL_BLOCKSandLINUX_THREAD_SCALE_TOTAL_BLOCKS.
This evidence does not prove the primary remaining cause of non-scaling. Per-CPU runnable ownership, accepted work/total speedup thresholds, and optional symbolic guest attribution remain follow-on work before a scheduler policy claim.
This protects the design from treating QEMU/KVM, serial MMIO, or benchmark cache contention as a scheduler algorithm problem.
Stage 1: Per-CPU Runnable Ownership
Split the scheduler’s runnable state first. The accepted initial shape has
per-CPU run queues with a runnable ThreadRef deque or priority buckets,
current-thread state, a local reschedule flag, and local counters. Shared
scheduler state keeps process/thread metadata, sleeping/deadline waiters,
blocked waiters, migration records, and the global policy epoch.
Rules:
- A runnable
ThreadRefis owned by exactly one CPU queue at a time. - Cross-CPU wake enqueues to the target CPU or a policy-selected CPU and sends a bounded reschedule IPI when needed.
- Migration removes from one owner before publishing to another.
- Idle CPUs steal only through bounded policy, not by scanning every process.
- Process exit and thread exit keep cleanup bounded and must not allocate in interrupt, cancellation, or emergency paths.
This stage may still use round-robin within each CPU queue. The objective is SMP structure and evidence, not perfect fairness.
First implementation evidence exists as commit 1a8bf909: capOS introduced
four bounded per-scheduler-CPU FIFO runnable queues under the existing
global scheduler lock. That slice proved the basic ownership structure and
bounded steal path. Follow-up review fixes reserved per-CPU queue capacity
before a thread became runnable, using a live reservation count released on
process/thread exit or pre-publication rollback, so timer and unblock
requeues did not allocate after work moved between CPUs. Update 2026-05-02:
the per-CPU queues were collapsed back into a single global runnable queue
under the same scheduler lock with the per-CPU run-queue-collapse cleanup
slice (see docs/backlog/scheduler-evolution.md and
docs/architecture/scheduling.md). Update 2026-05-07 23:45 UTC: Phase D
Task 3 reintroduced the per-CPU runnable queues, this time ordered
ascending by virtual_finish_ns (Weighted Fair Queueing) and balanced by
a bounded steal path that picks the most-overdue sibling Runnable
candidate (each sibling queue’s first entry the destination CPU
considers Runnable; ties broken by lower CPU id). The queue ownership
and migration contract is documented in the scheduling architecture
page. This does not close the stage: the scheduler still
needs stronger cross-CPU wake counters, further separation from shared
process/thread metadata, replacement of temporary pinning policy, and
accepted benchmark evidence before policy conclusions should change.
Stage 2: CPU Accounting
Add a monotonic runtime charge model. ThreadCpuAccount records runtime,
last-start time, virtual runtime, context switches, preemptions, and voluntary
blocks. SchedEntity records weight, latency class, eligible time, and virtual
deadline.
Accounting must be stable enough to support fair scheduling, quotas, and future scheduling contexts. It must account context switches, blocking syscalls, endpoint direct handoff, timer preemption, thread exit, and idle.
Where exact cycle attribution is not yet credible, the implementation should label the metric as diagnostic rather than enforcing policy from it.
Stage 3: Best-Effort Fair Policy
Stage 3’s first implementation slice has landed. Phase D passed its Task 6
evidence gate at commit 77caafc0 (2026-05-10 19:39 UTC,
docs(scheduler): record phase d thread-scale gate) and closed in docs commit
1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d) with
weighted fair queueing (WFQ) as the accepted best-effort policy. The
controlled Task 6 benchmark pair recorded capOS 1-to-4 work/total
speedups 3.088x / 2.700x at 4 workers, materially closing the
prior single-global-queue 1.566x / 1.538x diagnostic gap while
the matching Linux pthread baseline on the same host and physical-core
logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. The completed
execution plan is archived at
docs/backlog/scheduler-evolution.md.
After Phase D, capOS should continue ordinary best-effort scheduling from WFQ toward virtual-time fairness with stronger eligibility semantics only when that follow-on is explicitly selected.
The long-term target policy is EEVDF-like:
- runnable entities accrue lag against their fair share;
- eligible entities are ordered by virtual deadline;
- weights affect virtual runtime/deadline progression;
- latency-sensitive best-effort entities can request smaller slices within policy limits;
- migration preserves accounting so moving CPUs does not reset fairness.
The first implementation slice was intentionally narrower than EEVDF: weighted fair queueing on top of the existing per-thread runtime/vruntime accounting. That decision and its accepted evidence are recorded in the next subsection.
Phase D first-policy decision (2026-05-05 19:00 UTC)
Decision: weighted fair queueing (WFQ) for the first Phase D slice; EEVDF
remains the deferred follow-on. Recorded against main commit
60e421ab and the 2026-05-02 21:38 UTC thread-scale evidence pair
against main commit 374f8556 (capOS work 1.566x versus Linux
3.963x at 1-to-4 on the same physical-core pin set).
Rationale (concise):
- The 1-to-4 gap is dominated by single-global-queue scheduler-lock contention plus exit/join/block/schedule overhead, not by ordering. Any fair-share policy that successfully consumes a per-CPU split should close most of the gap. The simpler policy reaches that signal sooner with less risk.
- The existing
ThreadCpuAccountingrecord separates the load-bearing ledger from benchmark diagnostics:runtime_ns,virtual_runtime_ns, andlast_started_nsare unconditional, whilecontext_switches,preemptions,voluntary_blocks,migrations, placement history, and blocked/exited stability probes stay behindcfg(feature = "measure"). WFQ needs only a per-thread weight and a virtual finish time derived from the unconditional vruntime; that mapping is direct. EEVDF additionally needs a per-thread request size, lag, eligibility deadline, and an ordered eligible-set structure (BTreeMapby virtual deadline). The runtime/vruntime accounting fields exist, but the eligibility/lag fields do not. - The target environment is
no_stdplusspin::Mutexplus a single global scheduler lock. WFQ keeps the eligibility structure as a bucketed per-CPU FIFO ordered approximately by virtual finish time; that is a familiarVecDeque-shaped data structure that mirrors the currentrun_queue: VecDeque<ThreadRef>ownership. EEVDF requires an ordered set inside the scheduler-lock-protected dispatch state, which is a larger structural change than the slice the gap evidence motivates. - Latency-class differentiation (interactive / batch / IPC server) is expressible in WFQ; Phase D pins the mapping below in the capability-surface section so the implementation slice and the short-sleeper smoke have one rule. The Phase H policy service can layer richer policy on top without requiring a tree representation underneath.
- Linux moved from CFS to EEVDF in mainline 6.6 (released 2023-10); WFQ has decades of stable OS lineage. Either choice is defensible. The weighted-fair slice does not lock capOS into WFQ permanently — the same accounting fields, capability surface, and migration contract carry directly into EEVDF when the eligibility structure is added.
Rejected alternative: EEVDF-first. It is the stronger long-term
policy and Linux’s current default. We are not picking it for the first
slice because (1) the eligibility-set data structure is a larger
diff that mixes structural change with the per-CPU enqueue
reintroduction the 1-to-4 gap evidence already motivates; (2) the lag
accounting and request-size ABI are not load-bearing for closing the
single-global-queue contention bottleneck the recorded benchmark
exposes; (3) moving from WFQ to EEVDF is a localized policy-module
change once the capability surface, migration contract, and per-CPU
queue split are accepted. The deferred EEVDF follow-on is tracked as
a later policy-evaluation slice; it is not a Phase D blocker and does
not displace Phase E SchedulingContext, which is the next scheduler
authority phase after the accepted WFQ gate.
First-slice scope (smallest implementable surface that closes the 1-to-4 gap):
- per-thread
weight: u16andlatency_class: LatencyClassfields, default values matching the current single-class FIFO behavior; the cap-boundary path rejectsweight = 0and any nonzero value outside[MIN_WEIGHT, MAX_WEIGHT](Phase D constants) withCapException::InvalidArgumentrather than silently clamping, so no later divide-by-zero or overflow path can be reached throughsetWeightand so callers see policy denial instead of a hidden mutation. TheinvalidArgumentvariant landed inExceptionTypealongsideSchedulingPolicyCapandLatencyClasswith Phase D Task 1 (commit cb8c58b1, 2026-05-07); seedocs/proposals/error-handling-proposal.mdfor the updated client-response taxonomy. The full validation rule lives in the cap-surface authority section below; this bullet records only that the validation runs at the cap boundary, not the dispatch path; - per-thread weighted vruntime charging at runtime-charge points: the
existing
ThreadCpuAccounting.virtual_runtime_nsadvances byelapsed_ns * REFERENCE_WEIGHT / weight(instead of the current 1:1 elapsed) on every charge_runtime call.runtime_nscontinues to advance 1:1 with elapsed time so monotonic CPU accounting, measure-mode reporting, and snapshot APIs are unchanged. The weighted-vruntime change is the actual fairness mechanism; without it, weights affect only enqueue-order ties rather than cumulative share. This matches the CFS-lineage approach and keeps the WFQ derivationvirtual_finish = vruntime + slice * REFERENCE_WEIGHT / weightpurely as an ordering aid for the local bucket; - per-thread
virtual_finish_ns: u64recomputed at each enqueue fromvirtual_runtime_ns + slice_ns * REFERENCE_WEIGHT / weight. It is not stored across blocking and is never carried as committed state; it is the per-enqueue ordering tag only; - per-CPU bounded
run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS](reintroduced) each ordered ascending byvirtual_finish_ns; local selection scans the queue by index for the first destination-Runnable entry (RetryLater entries left in place; the first Runnable hit is also the lowestvirtual_finish_nscandidate the destination can accept because the queue is ordered), then falls back to a bounded steal scan of sibling per-CPU queues; - scheduler-lock-contained migration that keeps
virtual_runtime_nswith the thread (per-thread state, not per-CPU) and re-inserts on the destination CPU at the post-migration virtual finish time; - a capability-authorized policy path (see §“Phase D capability surface” below) that gates weight/latency-class mutation and reads;
- one-bisect-cycle single-global-queue fallback under
CAPOS_SCHED_DISABLE_WFQ=1, now retired by Phase E preflight beforeSchedulingContextschema work.
The first slice is accepted: the 2026-05-10 19:46 UTC
make run-thread-scale evidence pair recorded in docs/changelog.md
and docs/benchmarks.md passed the harness-enforced 1-to-2 work/total gates,
and Phase D manually accepted the recorded 1-to-4 work/total diagnostics for
closeout. The historical success threshold lives in
docs/backlog/scheduler-evolution.md.
Phase D capability surface (kernel-side authority, no ambient process fields)
Per docs/capability-model.md “the interface IS the permission”, weight
and latency-class authority is granted by giving a process a
SchedulingPolicyCap with the appropriately scoped target. The kernel
rejects any state mutation that does not arrive through such a cap.
Schema (landed with Phase D Task 1, commit cb8c58b1, 2026-05-07; the
original sketch took a target :ThreadHandle per method, but the
methods carry no target argument because Phase D associates the
target through cap state, not a per-method handle parameter.
Phase D Task 2 (closeout 2026-05-07 22:51 UTC) selected the
context-derived caller-thread fallback binding from the three
sketched options. Every method routes to the calling thread,
looked up through CapCallContext::caller_thread. The kernel
cap object remains zero-sized (SchedulingPolicyCap); routing
moved from call to call_with_context so the dispatch path
sees the caller’s ThreadRef. There is no per-cap-object
ThreadHandle, no badge-encoded thread id, and no cross-thread
or cross-process mutation in this slice; per-cap-object target
references and badge-encoded thread ids are reserved for the
Phase H privileged scheduler policy service that will need
cross-thread authority. Today the manifest grant path therefore
authorizes the holder’s own threads in the strict sense – a
holder cannot reach another thread’s weight or latency_class
through this cap):
enum LatencyClass {
interactive @0;
normal @1;
batch @2;
ipcServer @3;
}
interface SchedulingPolicyCap {
setWeight @0 (weight :UInt16) -> ();
setLatencyClass @1 (class :LatencyClass) -> ();
snapshot @2 ()
-> (weight :UInt16, class :LatencyClass,
runtimeNs :UInt64, virtualRuntimeNs :UInt64);
}
The snapshot return is intentionally narrow: the four fields it
exposes (weight, class, runtimeNs, virtualRuntimeNs) are
the ones the WFQ slice promotes out of cfg(feature = "measure")
unconditionally. The benchmark-only counters
(context_switches, preemptions, voluntary_blocks,
migrations) stay behind the measure feature because they are
not load-bearing for ordering and remain useful only for
benchmark instrumentation; a future operator-observability slice
can add them to a separate snapshot cap once a non-emergency-path
storage and reporting surface exists.
Authority rules:
setWeightandsetLatencyClassare kernel-checked: an SQE invocation must carry a liveSchedulingPolicyCap. The methods carry no per-callThreadHandle; the target binding (selected in Phase D Task 2) is the context-derived caller-thread fallback: the kernel routes throughCapCallContext::caller_thread, so a holder can only mutate its own running thread by construction. If a future cross- process grant lets a holder invoke the cap without authority over its bound target, the call fails closed through the standard cap-revocation transport-error path (thedisconnected-classCapExceptionproduced by the ring dispatcher when the cap is revoked or stale); theExceptionTypetaxonomy has noDeniedvariant by design.setWeightvalidates the input at the cap boundary, not at the dispatch path. The validation rule is:weight = 0(which would make the WFQ derivationslice_ns * REFERENCE_WEIGHT / weightdivide by zero) is rejected withCapException::InvalidArgument; any nonzero value outside[MIN_WEIGHT, MAX_WEIGHT](Phase D constants) is also rejected withCapException::InvalidArgument. The kernel does not silently clamp out-of-range values, because a silent clamp masks caller bugs and hides cap-boundary policy from the audit surface. TheinvalidArgumentvariant landed inExceptionTypewith Phase D Task 1 (commit cb8c58b1, 2026-05-07); the updated client-response taxonomy is indocs/proposals/error-handling-proposal.md.- The bootstrap
SchedulingPolicyCapis granted by manifest only. Its initial domain isSelf(the holder’s own threads). Wider authority (cross-process weight/class mutation) belongs to the Phase H privileged scheduler policy service; Phase D does not promise that grant in the default boot manifest. Phase D manifests grant only the focused-proof scope needed for the test-matrix smokes. - Default policy: a thread without any explicit cap-driven mutation
carries
weight = DEFAULT_WEIGHTandlatency_class = LatencyClass::Normal. Behavior with all defaults must preserve the pre-Phase-D default workload behavior at the limit (no fairness regressions for unmodified workloads). - Stale-cap revoke:
SchedulingPolicyCapmutations carry the generation/epoch model used elsewhere. A weight change submitted after the cap is revoked fails closed; partially applied changes on a thread that exits between SQE arrival and dispatch fail with the standardStaleoutcome and do not leak weight state. - The cap surface is a single typed interface; restriction is by
granting a narrower wrapper (e.g.,
SchedulingPolicyCapwhose authority domain is exactly oneThreadHandle). The kernel does not carry a parallel rights bitmask.
Latency-class semantics for Phase D (pinned mapping):
LatencyClass::Normalis the baseline;weightalone determines the WFQ share. The selectedslice_nsis the Phase D default quantum.LatencyClass::Interactivereduces the per-enqueue slice contribution by a Phase D constant (INTERACTIVE_SLICE_DIVISOR; Phase D Task 2 ships2): the WFQ derivation becomesvruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) * REFERENCE_WEIGHT / weight. This places the entity earlier in the per-CPU queue on each enqueue, so a short-sleeper that wakes on a Timer completion runs ahead of a same-weight CPU hog within the same scheduling window. The cumulative share is unchanged because vruntime accounting still advances atelapsed_ns * REFERENCE_WEIGHT / weight; the class only affects the per-enqueue tag, not the runtime-charge step.LatencyClass::Batchincreases the per-enqueue slice contribution by a Phase D constant (BATCH_SLICE_MULTIPLIER; Phase D Task 2 ships4): the derivation becomesvruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT / weight. This places the entity later in the per-CPU queue on each enqueue, so a CPU hog atLatencyClass::Batchyields wake-to- run latency toLatencyClass::NormalandLatencyClass::Interactivesiblings without losing its weighted share over a long window.LatencyClass::IpcServeris treated identically toLatencyClass::Normalfor the WFQ ordering tag in this slice. The class exists in the ABI so a Phase H policy service can later re-bind direct-IPC preference, server affinity, or scheduling-context donation rules without an ABI break; Phase D does not change the existing direct-IPC preference slot semantics for this class.- The class is stored on
Threadand read at every enqueue. A class change throughsetLatencyClassis observed on the next enqueue (next dequeue + re-enqueue, or next wake from blocked). No retroactive recomputation of an in-queue tag.
Phase D does not build the userspace policy service (Phase H). It
adds the kernel-side primitive that Phase H will consume.
SchedulingContext (Phase E) is a separate authority for
budget/period/CPU mask; weight/latency-class is the WFQ ordering knob,
not CPU-time authority. The two cap surfaces stay disjoint.
Phase D migration fairness sketch
A thread migrating from CPU A to CPU B mid-quantum must preserve its share. Rules:
virtual_runtime_nsis per-thread, not per-CPU. It travels with the thread on every migration. The accounting record already encodes that (ThreadCpuAccounting.virtual_runtime_nslives onThread, not on a CPU slot). Phase D promotes that field out ofcfg(feature = "measure")and changes thecharge_runtimestep so the field advances byelapsed_ns * REFERENCE_WEIGHT / weightrather than 1:1 with elapsed time; the migration contract is otherwise unchanged.- Per-CPU local clocks are not used as a vruntime reference. The
scheduler reads the global monotonic clocksource through
crate::arch::context::monotonic_ns(), the same source the unconditional runtime/vruntime ledger uses. There is no per-CPU clock offset because there is no per-CPU vruntime reference. virtual_finish_nsis recomputed at enqueue on the destination CPU from the destination weight, not carried as committed state. The migration step is remove-from-source, recompute, insert-at-destination; the scheduler lock is held for the whole window.- Cross-CPU steal: a CPU whose local queue has no runnable entry
walks sibling per-CPU queues. For each sibling queue the scan
walks indices ascending and stops at that queue’s first entry
the destination CPU considers
Runnable; because each queue is ordered ascending byvirtual_finish_ns, the first Runnable hit per queue is the lowestvirtual_finish_nscandidate the destination can accept on that source. The steal target is then the source queue whose first-Runnable candidate has the lowestvirtual_finish_nsglobally — the same fair-share rule the local pick uses (most overdue first) — with ties broken by lower CPU id. The chosen entry is removed from its actual position on the source queue (not necessarily the head: a RetryLater or single-CPU-owner thread may sit at the front and stay there); the destination recomputesvirtual_finish_nsand inserts at the destination ordered position. The steal is allocation-free because both queues are pre-reserved against the live runnable count. - The
ThreadCpuAccounting.migrationscounter is incremented on each cross-CPU enqueue, both for placement-time spread and for steal. The behavior mirrors the prior pre-collapse counter; the Phase D slice keeps it undercfg(feature = "measure")until a permanent operator snapshot path lands.
The one-bisect-cycle single-global-queue fallback has been retired before Phase E. The accepted Phase D behavior is now always the per-CPU WFQ queue shape described above.
Phase D test matrix
Workload shapes the implementation slice verified before close:
- CPU hogs (existing
make run-thread-scale). Equal-weight same-process threads must split CPU share within bench tolerance. Different-weight threads must split CPU share approximately in proportion to weights (e.g., weights2:1→ roughly2:1runtime ratio). Phase D manually accepted the recorded 1-to-4 diagnostic at3.088xwork speedup versus the recorded1.566xbaseline. - Short sleepers. Threads that block on
Timer.sleepfor short intervals must preempt CPU hogs within one quantum’s worth of bound after wake. Latency-classInteractiveshould have lower observed wake-to-run latency than latency-classBatch. Phase D closed this with focusedmake run-thread-fairnessandmake run-thread-fairness-interactiveQEMU smokes. - Direct IPC server/client pairs (existing
make run-spawn). An IPC server thread woken by an endpoint CALL must keep paired-call timing comparable to the current direct-IPC handoff. The direct-IPC preference slot must keep its existing generation-checked semantics under WFQ; a server should not starve when the global vruntime advances on other CPUs. - Multi-process load (existing
make run-smp-process-scale). Independent worker processes with default weights must preserve the recorded2026-04-301.6x1-to-2 gate. WFQ across processes (no shared address space) must not regress that proof. - Same-process sibling load. This is the same workload shape
as
make run-thread-scale; it doubles as the per-CPU-queue reintroduction proof.
The exact historical per-workload acceptance numbers live in
docs/backlog/scheduler-evolution.md.
Phase D overload behavior
Soft overload (runnable entities × weight exceeds the selected CPU set’s capacity):
- Each entity gets less than its weighted share. No entity is starved; vruntime ordering guarantees that the most-behind thread runs next.
- The scheduler does not refuse to enqueue. Phase D’s WFQ does
not implement strict admission; that belongs to Phase E
(
SchedulingContextbudget/period) and Phase G (RealtimeIslandadmission).
Hard overload (e.g., a RealtimeIsland admission attempt that
collides with an active CpuIsolationLease):
- Use the existing isolation/admission path; Phase D defers to
Phase F’s
CpuIsolationLeaseand Phase G’sRealtimeIslandfor that behavior. WFQ continues to schedule best-effort work on the housekeeping CPU set. - If an isolation lease holds CPU N and N has runnable best-effort work that cannot migrate (e.g., bound by manifest pinning), the lease attempt fails closed; existing CPU-mask validation remains the gate. Phase D does not introduce new pinning policy.
Strict admission, deadline overrun, and budget depletion are explicitly out of scope for Phase D and stay in Phase E/G.
Stage 4: Scheduling Contexts
CPU-time authority becomes a capability. SchedulingContext records budget,
period, relative deadline, priority or criticality, CPU mask, remaining
budget, replenishment state, timeout endpoint, and overrun policy.
The landed Phase E slices remain narrower than the full target above. The ABI
now has SchedulingContextSpec authority inputs for budgetNs, periodNs,
relativeDeadlineNs, byte-oriented cpuMask, and overrunPolicy, plus a
read-only SchedulingContextInfo snapshot with context identity, lifecycle
state, binding state, remaining budget, and an explicit dispatch-effect label.
SchedulingContext.info() remains method id 0. SchedulingContext.create()
creates a same-interface result cap for a validated spec,
bindCallerThread() records one caller-thread binding for the current
generation, and revoke() advances the generation and clears the matching
thread metadata binding. Bootstrap-granted contexts and contexts returned by
create() draw from the same non-wrapping context-id allocator, so the
(contextId, generation) binding key does not alias distinct cap objects.
Bound active contexts now install a fixed per-thread dispatcher budget ledger:
runtime charge decrements remainingBudgetNs, runnable selection replenishes
elapsed periods, and exhausted contexts remain queued but ineligible until the
next replenishment period. The effect label is budgetEnforced for active
contexts and stays infoOnlyNoDispatchChange for stale/revoked fail-closed
paths. Deadline-driven accounting now arms a sub-tick budget-exhaustion
one-shot when the selected thread’s remaining budget would deplete before the
next periodic scheduler tick, and nohz re-arm folds the leased thread’s budget
deadline into its existing nearest-deadline timer. Kernel-mode budget one-shot
fires restore a live periodic timer before returning to kernel code, so the
ordinary and tick-masked paths no longer rely on a full tick quantum to observe
budget depletion.
Synchronous endpoint donation/return now covers passive receiver threads:
endpoint in-flight state carries an internal donation token, receiver runtime
charges to the caller-donated context, RETURN, application-exception RETURN,
or invalid-result RETURN restores the reduced budget to the caller before
caller wake, a donor with an in-flight token is blocked from returning to
userspace until RETURN/cancel using an atomic marker-to-block transition that
treats already-returned fast paths as normal completion, and nested donation of
an already donated context is rejected until stacked return tokens have a
dedicated design.
Timeout/depletion notifications now use fixed per-context cells allocated at
context creation/bootstrap. The cells coalesce budget-depleted and
deadline-or-timeout events with typed sequence/count metadata, holder identity,
remaining budget, next timestamp, donated-holder marking, explicit-revoke
lifecycle state, and ok/revoked/staleGeneration observer results through
SchedulingContext.drainNotifications(). Notification publishing does not
allocate in scheduler hard paths, publish result caps, append unbounded queues,
donate budget, reorder runnable entities, bypass throttling, or imply nohz
behavior. A pre-armed observer waiter/wakeup path, realtime admission, SQPOLL,
nohz, and CPU placement enforcement remain future work. Stale caps report
staleGeneration and cannot mutate the new generation’s scheduler metadata or
budget ledger; revoked contexts report revoked. Ordinary non-donated
session logout now uses the same stale-generation rule: after
UserSession.logout() flips the liveness cell, the scheduler removes matching
non-donated bound thread contexts and marks the old cap generation stale. The
focused session-context proof covers stale info, bindCallerThread,
create, revoke, and notification-drain behavior without result-cap
publication or metadata mutation. Donated receiver logout keeps the
conservative skip policy: if logout observes a receiver thread holding an
endpoint-donated context, the hook counts the skipped donated binding and
leaves the donor blocked until endpoint RETURN/cancel commits cleanup. The
focused session-context proof covers the RETURN case by showing the receiver
logs out while holding the donation, the donor stays blocked, the hook reports
donation_inflight_skipped=1, and the caller observes a bound context with
reduced remaining budget after RETURN rather than fresh budget. Clean local
owner-shell exit now calls the held UserSession.logout() before process exit,
and the shell smoke observes the same scheduler hook with no bound local shell
SchedulingContext.
cpuMask is a canonical little-endian bitset. CPU n maps to bit n % 8 of
byte n / 8, with bit 0 as the least-significant bit of each byte. Empty data
means no CPUs are selected, not “all CPUs”; future admission/bind validation
rejects empty masks for runnable contexts. Producers omit trailing zero bytes:
the all-zero set is encoded as empty, and any non-empty canonical mask has a
nonzero final byte. This slice only snapshots that shape and does not enforce
placement from it.
Remaining kernel responsibilities:
- prevent a thread without eligible CPU authority from running;
- charge runtime to exactly one authority target;
- add any pre-armed timeout/depletion observer wake path without allocating in emergency paths.
Policy-service responsibilities:
- admit or reject scheduling contexts;
- choose budget/period/priority;
- bind contexts to threads/services;
- revoke or adjust contexts safely;
- record operator-visible decisions.
SQE.deadline_ns remains request metadata. It may influence drop, freshness,
propagation, and telemetry, but it does not grant CPU budget.
Stage 5: CPU Isolation Leases and SQPOLL
CpuIsolationLease grants placement and exclusivity, not CPU time. It records
the owner process/session/service, CPU set, mode, housekeeping exclusions,
accounting target, maximum revocation latency, and revoke endpoint.
The current Phase F implementation keeps ticks periodic but makes
housekeeping/deferred-work placement explicit: at least one online scheduler
housekeeping CPU must remain outside active lease candidates, and preflight
telemetry routes or rejects deferred cleanup, timer/deadline, network polling,
IRQ affinity, scheduler accounting, and cleanup latency before later SQPOLL or
nohz behavior can use the lease.
The Phase F substrate landed so far is:
- the one-SQ-consumer ring-ownership prerequisite that lets nohz/SQPOLL reason about a single submission consumer per ring;
- nohz activation telemetry that labels admit/reject decisions, rollback reasons, and current periodic-tick fallback state without changing dispatch behavior;
- housekeeping/deferred-work placement preflight, which fail-closes when unrelated timers, deferred cleanup, network polling, debug/watchdog work, or IRQ delivery would otherwise be pinned to a candidate isolated CPU;
- a bounded SQPOLL ring-mode worker (
MAX_SQPOLL_WORKERS = 16) that recordstick_suppression=disabled/full_nohz=disabledstrings while the activation proof is still open, with generation-checked stale-owner rollback; - a clockevent/deadline substrate independent of the periodic tick, so the scheduler can express “wake at deadline T” without depending on periodic ticks to enforce budget;
- a bounded non-periodic SQPOLL producer-wake progress path that lets a parked SQPOLL worker make forward progress on producer activity without reverting to a periodic tick.
Automatic nohz activation – actually suppressing the periodic scheduler
tick on an admitted CPU and restoring it on rollback/revoke/stale
generation – was closed for the first increment via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md:
the CpuIsolationLease preflight now performs
real per-CPU periodic-tick suppression for the narrow single-runnable-entity
window, satisfying proof obligations for single runnable entity on the
target CPU, ready housekeeping CPU outside the lease, non-local
deferred-cleanup/timer/network/IRQ dependencies, valid accounting target,
bounded revocation latency, and generation-checked ring ownership, with
fail-closed rollback. SQPOLL-driven auto-nohz activation is also closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md:
a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL
running/sleeping mode with a live owner is admitted for tick suppression,
with the SQPOLL ring-state re-check as the decisive rollback gate. The
tick_suppression, auto_nohz, and sqpoll telemetry counters reflect
real suppression. Generic full-nohz for ordinary budgeted compute threads is
now admitted by explicit SchedulingContext-targeted CpuIsolationLease
preflight; production realtime island admission remains deferred independently
of these closed tasks.
Activation requires scheduler proof:
- at least one housekeeping CPU remains online;
- unrelated timers, deferred cleanup, network polling, and debug/watchdog work are not pinned to the isolated CPU;
- the active ring has exactly one SQ consumer;
- the accounting target is valid and chargeable;
- revocation latency fits the lease policy.
The scheduler idle path is now a per-CPU CPL0 (kernel-mode) idle thread;
the user-mode idle process was removed in commit e3c0df01 (2026-05-14 UTC).
There are two CPL0 idle paths: the cooperative boot/AP path that hlts at
CPL0 on the per-CPU kernel stack, and the steady-state idle-thread path
reached from the four dispatch sites (schedule, capos_block_current_syscall,
exit_current, exit_current_thread). Both are described in detail in
Scheduling.
SQPOLL uses the ring-mode contract in Tickless and Realtime Scheduling. The scheduler proposal adds the CPU-ownership and policy-service side of that contract.
Stage 6: Realtime Islands
A RealtimeIsland is an admitted graph, not a single priority. It records
scheduling contexts, memory reservations, device and IRQ reservations,
rings/endpoints/notifications, any CPU isolation leases, admission evidence,
and overrun/shutdown policy.
Use cases include local audio, realtime voice, robotics control, and selected provider/runtime loops. Admission must fail closed if the graph cannot fit the declared period/quantum and reservations.
Stage 7: User-Space Scheduler Policy
After kernel primitives are in place, a privileged scheduler policy service can own:
- default resource profiles;
- session/account/service CPU policy;
- scheduling-context admission;
- CPU lease grant/revoke;
- runtime hints such as latency-sensitive, batch, driver, poller, or agent;
- AutoNoHz placement for ordinary threads that appear capable of utilizing a full CPU core (see Policy-Service Userstories in tickless-realtime-scheduling-proposal);
- operator-facing diagnostics and policy reload.
AutoNoHz placement is the policy-service surface that turns the “thread
appears capable of utilizing a full CPU core” observation into a bounded
CpuIsolationLease against a pre-authorized account or session CPU pool. The
lease adds isolation; it does not mint CPU-time authority. The thread still
consumes time through its existing SchedulingContext (or coarse
ResourceLedger); the lease just removes tick and scheduler noise while that
budget is being consumed. Bounds the policy service must enforce on every
auto-issued lease – lifetime, revocation latency, accounting target,
auto-claim pool capacity, and fairness preemption – are detailed in the
tickless proposal.
The kernel still owns emergency fallback. If the policy service is dead, blocked, stale, or malicious, the kernel must continue to enforce safety, revoke leases as policy permits, and schedule a minimal recovery path.
Validation Gates
- Per-CPU queue work must preserve
run-smoke,run-spawn,run-thread-scale, park/ring/process-exit smokes, and SMP smokes. - A thread-scale milestone closeout must include repeated controlled
capos-benchevidence and raw logs. - CPU accounting must include sanity tests that measured runtime increases monotonically while a thread runs and stops while it is blocked.
- Fair policy changes must include adversarial tests: CPU hogs, short sleepers, direct IPC handoff, multi-process load, and same-process sibling load.
- Scheduling-context work must include admission rejection, budget depletion, replenishment, endpoint donation/return, timeout notification, stale cap revocation tests, and any future pre-armed notification waiter coverage.
- CPU leases must include revocation, process exit, session close, and housekeeping fallback tests.
- Realtime island proofs must show preallocation, no allocation/blocking on admitted paths, deadline miss telemetry, and fail-closed overrun behavior.
Open Decisions
Whether the first best-effort fair policy should be weighted fair queueing or direct EEVDF.Resolved 2026-05-05 19:00 UTC: WFQ first; EEVDF deferred follow-on. See “Phase D first-policy decision” above.- Whether scheduling-context priority is a scalar, a criticality band, or both.
- Whether
SchedulingContextshould be bindable to a process default, individual thread, endpoint call path, or all three in the first ABI. - Which scheduler telemetry is permanent ABI and which is benchmark-only.
- How much policy-service state belongs in the boot manifest versus mutable operator configuration.
- Whether the WFQ slice’s bucketed
VecDequeper-CPU queue is the long-term representation or a stepping stone to an EEVDFBTreeMap-based eligibility set. EEVDF is an evaluated follow-on policy, not a committed migration; re-evaluate only when the explicit Phase D follow-on EEVDF migration backlog item is selected. Phase F’s one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, bounded SQPOLL ring mode, clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress have landed on top of the closed Phase ESchedulingContextgate; the first automatic nohz activation increment is also closed viadocs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.mdand SQPOLL-driven auto-nohz activation is closed viadocs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed. The policy-service AutoNoHz capstone and generic SQPOLL nohz for arbitrary rings remain open. Phase F.5 (full-SMP 16/32-core scalability) is still planning.