Plan: Scheduler Phase D — Weighted-Fair Best-Effort Scheduling
Overview
Implementation track for the Phase D best-effort fair-share policy
chosen in docs/proposals/scheduler-evolution-proposal.md “Phase D
first-policy decision (2026-05-05 19:00 UTC)”. The selected policy is
weighted fair queueing (WFQ) on top of the existing per-thread
runtime_ns / virtual_runtime_ns accounting, with reintroduced
per-CPU runnable queues and a capability-authorized
SchedulingPolicyCap for weight and latency-class mutation. EEVDF
remains the deferred follow-on once the WFQ slice has accepted
thread-scale evidence and the open Phase E SchedulingContext work
has not yet displaced fair-share-only ordering.
The proposal section linked above is the design source of truth for the policy choice, the rejected alternative (EEVDF-first), the capability surface, the migration fairness sketch, the test matrix, and overload behavior. The tasks below decompose that design into implementation work; each task ends with the matching validation gate. The plan is ad-hoc until selected and then runs as the selected scheduler milestone.
This plan replaces the bare WORKPLAN bullet “Scheduler Phase D –
best-effort fair scheduling” once the design slice merges. Phase E
(SchedulingContext) and Phase F (auto-nohz / SQPOLL) keep their own
backlog/plan ownership; this plan does not extend into them.
Conflict Surface
Owned by this plan:
kernel/src/sched.rs(per-CPUrun_queuereintroduction, WFQ ordering helpers, migration/steal path, capability-authorized weight/class mutation hooks).kernel/src/process.rs(Thread.weight,Thread.latency_class,Thread.virtual_finish_nsfield additions; default values match the current single-class FIFO behavior).kernel/src/cap/sched_policy.rs(NEW kernel cap implementation forSchedulingPolicyCap).kernel/src/cap/mod.rs(cap registration only).schema/capos.capnp(newLatencyClassenum andSchedulingPolicyCapinterface; queues on the shared serial surface perdocs/plans/README.mdConcurrency Notes).tools/generated/(regenerated capnp bindings via the existingmake generated-code-checkgate).capos-rt/src/client.rs(newSchedulingPolicyClienttyped wrapper for the userspace runtime).capos-config/src/manifest.rsand the matching schema additions for manifest-grantedSchedulingPolicyCaprecords.tools/qemu-thread-scale-harness.sh(only if the harness needs WFQ-specific assertions; if a separate fairness smoke is added, it gets its owntools/qemu-*-smoke.sh).docs/proposals/scheduler-evolution-proposal.md(status updates and Phase D closeout stamps only; the design content already landed with this plan).docs/plans/scheduler-phase-d.md(this file).docs/plans/README.mdTrack Map row for this plan.docs/architecture/scheduling.md(state-of-implementation updates as per-CPU queues and WFQ ordering land; mark “single global runnable queue” as historical when the per-CPU split returns).docs/backlog/scheduler-evolution.mdPhase D bullets and the matching closeout stamps.
Coordinated overlap with sibling tracks:
schema/capos.capnp: serialise on the shared serial surface perdocs/plans/README.mdConcurrency Notes. Phase D adds new interface entries and must not run concurrently with another schema-touching plan.kernel/src/cap/: this plan adds one new cap module (sched_policy.rs) and touches the cap registration list. Other active plans that touchkernel/src/cap/(Device Driver Foundation, POSIX P1.2/P1.3) are kernel-core-serial work; do not run them concurrently with this plan.kernel/src/sched.rs: this plan owns scheduler-core changes. Other plans must not modify the runnable queue, dispatch state, or weight/class fields while this plan is active.kernel/src/process.rs: this plan adds Thread fields. Other plans must not modify Thread state during the active slice.
Do not touch from this plan:
kernel/src/cap/sched_context.rs(Phase E surface; not yet written, owned by the future Phase E plan).kernel/src/cap/cpu_isolation_lease.rs(Phase F surface; not yet written, owned by the future Phase F plan).kernel/src/cap/realtime_island.rs(Phase G surface).- Userspace policy service (Phase H); the Phase D cap surface must be Phase H-consumable but Phase D does not build the policy service itself.
tools/remote-session-client/(owned by remote-session plan).docs/topics.md(auto-regenerated; never edit manually).- Any unrelated proposal/plan file.
Validation Commands
make fmt-checkmake generated-code-checkcargo build --features qemucargo build --features qemu,measurecargo test-configcargo test-libcargo test-ring-loomcargo build-demos-caposmake capos-rt-checkmake run-smokemake run-spawnmake run-smp2-smokesmake run-thread-scale(the milestone gate; must materially close the recorded 1-to-4 capOS-vs-Linux gap)make run-smp-process-scale(regression gate; must keep the recorded 1-to-2 1.6x speedup against the multi-process proof)make run-measure(regression gate; the new accounting fields must not break the existing measure-mode proof line)
Success Criteria
Phase D is recorded done when:
- The
SchedulingPolicyCapinterface is inschema/capos.capnp, the kernel cap implementation is inkernel/src/cap/sched_policy.rs, the manifest grant path is wired throughcapos-config, and the userspace typed client is incapos-rt/src/client.rs. A focused QEMU smoke proves a manifest-granted cap can mutate weight and latency class on a targetThreadHandleand that a stale or revoked cap fails closed. - Per-CPU runnable queues are reintroduced under the WFQ
ordering rule. The single-global-queue fallback remains
selectable via
CAPOS_SCHED_DISABLE_WFQ=1for one bisect cycle and is retired before Phase E. - Migration preserves
virtual_runtime_ns(already per-thread) and recomputesvirtual_finish_nsat destination enqueue. The bounded steal path picks the source queue whose head has the lowestvirtual_finish_ns(the most overdue work another CPU has not yet dispatched), matching the local pick rule (front of the ascending per-CPU queue). - Materially close the 1-to-4 capOS-vs-Linux thread-scale gap.
Concretely: a 5-run controlled
make run-thread-scaleagainst the post-WFQ kernel, pinned to physical-core logical CPUs0,1,2,3oncapos-bench, must record capOS work speedup of at least2.5xat 1-to-4 (the recorded baseline is1.566x; Linux records3.963xagainst the same shape on the same pin set). The 1-to-2 row must keep the configured1.6xgate. Total speedup is reported as diagnostic and must not regress below the recorded1.538x1-to-4 baseline. make run-spawn,make run-smp2-smokes,make run-smp-process-scale, andmake run-measureremain green. The recorded multi-process1.6x1-to-2 gate from2026-04-30must hold.docs/proposals/scheduler-evolution-proposal.mdPhase D section,docs/backlog/scheduler-evolution.mdPhase D bullets,docs/architecture/scheduling.md,docs/changelog.md,WORKPLAN.md, anddocs/roadmap.mdcarry the closeout stamp with commit hash and minute-precision timestamp.
The plan is not scoped to deliver Phase E
(SchedulingContext budget/period authority), Phase F
(CpuIsolationLease and SQPOLL nohz), Phase G
(RealtimeIsland), or Phase H (userspace policy service). Those
are sequenced after Phase D and own their own plan files.
Task 1: Schema and capability surface
- Add the
invalidArgumentvariant to the existingExceptionTypeenum inschema/capos.capnp. The current enum has onlyfailed/overloaded/disconnected/unimplemented;setWeightpolicy denial below needs a distinct typed signal (caller bug rejection vs general failure vs back-pressure). This addition is part of the Phase D schema-surface acquisition documented in the proposal Phase D capability surface section. Keep the variant ordering stable for ABI compatibility. - Add the
LatencyClassenum (interactive,normal,batch,ipcServer) toschema/capos.capnpand regenerate bindings viamake generated-code-check. - Add the
SchedulingPolicyCapinterface withsetWeight,setLatencyClass, andsnapshotmethods. The snapshot return is narrow:weight,class,runtimeNs,virtualRuntimeNs. Those four fields are the ones Task 2 promotes out ofcfg(feature = "measure")unconditionally. Do NOT addcontextSwitches,preemptions,voluntaryBlocks, ormigrationsto the ABI in this slice; those counters stay benchmark-only and would either fail to compile in the normalqemubuild or expose fields the kernel does not track. A future operator-observability slice may add them through a separate snapshot cap. - Implement
setWeightvalidation at the cap boundary (not the dispatch path) with the rule from the proposal:weight = 0and any nonzero value outside[MIN_WEIGHT, MAX_WEIGHT](Phase D constants) are rejected withCapException::InvalidArgument. The kernel does NOT silently clamp out-of-range values; a future caller/test can rely on the rejection signal. This ensures no later divide-by-zero or overflow path is reachable through the cap. - Add a
KernelCapSource::SchedulingPolicyvariant under the manifest grant path so a manifest can grant the cap to a named process. Phase D grants the cap only to focused- proof manifests (system-thread-fairness.cueand similar Task 5 smokes); the default boot manifest does NOT grant the cap in this slice. Wider authority (cross-process weight/class mutation, default-grant to a userspace policy service) belongs to the future Phase H plan. - Add a
capos-rt::client::SchedulingPolicyClienttyped wrapper that maps transport errors andCapExceptiondecode shape consistently with the existing clients. - Validate:
make fmt-check,make generated-code-check,cargo test-config,cargo test-lib,cargo build --features qemu(warning-free).
Task 2: Per-thread weight and latency-class state
- Add
weight: u16andlatency_class: LatencyClassfields toThread(inkernel/src/process.rs), with default values matching the current single-class behavior (weight = DEFAULT_WEIGHT,latency_class = LatencyClass::Normal). These fields must be unconditional (not behindcfg(feature = "measure")) because they participate in dispatch ordering. - Promote
runtime_ns,virtual_runtime_ns, andlast_started_nsfromThreadCpuAccountingout ofcfg(feature = "measure")so the WFQ ordering, the runtime-charge path, and thesnapshotcap method work in the normalqemubuild. Thecontext_switches,preemptions,voluntary_blocks, andmigrationscounters stay behind themeasurefeature and are NOT exposed throughSchedulingPolicyCap.snapshotin this slice. Document the choice indocs/architecture/scheduling.md. - Change the
charge_runtimestep sovirtual_runtime_nsadvances byelapsed_ns * REFERENCE_WEIGHT / weightinstead of 1:1 withelapsed_ns.runtime_nscontinues to advance 1:1 with elapsed time so monotonic CPU accounting andsnapshot.runtimeNsare unchanged. This is the actual fairness mechanism; without it, weights affect only enqueue- order ties rather than cumulative share. - Add
virtual_finish_ns: u64derived per enqueue and not stored across blocking. The derivation rule depends onlatency_classper the proposal’s “Latency-class semantics for Phase D” subsection:NormalandIpcServerusevruntime + slice_ns * REFERENCE_WEIGHT / weight;Interactiveusesvruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) * REFERENCE_WEIGHT / weight;Batchusesvruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT / weight.slice_ns,REFERENCE_WEIGHT,MIN_WEIGHT,MAX_WEIGHT,DEFAULT_WEIGHT,INTERACTIVE_SLICE_DIVISOR, andBATCH_SLICE_MULTIPLIERare Phase D constants. - Add the kernel-side mutation entry points behind the
SchedulingPolicyCapdispatch only. No ambient process field, no per-process default, no syscall path that bypasses the cap. - Validate:
cargo build --features qemu,cargo build --features qemu,measure,cargo test-lib,make capos-rt-check.
Task 3: Per-CPU run queues and WFQ ordering
- Reintroduce
SchedulerDispatch.run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS](the per-CPU bounded queues retired in the 2026-05-02 collapse). Reuse the documented runnable-ownership invariants indocs/architecture/scheduling.md(single dispatch owner per liveThreadRefacross per-CPUcurrent/handoff_currentslots, the per-CPU run queues, and the direct IPC target slot). - Reintroduce the per-CPU live-reservation accounting that the pre-collapse design used: reserve all per-CPU queues to the live runnable-capable thread count before publication; release on process/thread exit or pre-publication rollback. Timer, unblock, direct-IPC fallback, and steal-requeue paths must remain allocation-free.
- Order each per-CPU
VecDequeascending byvirtual_finish_ns: enqueue inserts at the ordered position, selection picks the front (lowestvirtual_finish_ns= most overdue against fair share). The exact ordering structure (sorted insert vs. small bucket array) is an implementation choice; document the decision indocs/architecture/scheduling.md. - Restore the bounded steal path: a CPU whose local queue
is empty walks sibling per-CPU queues bounded by
SCHEDULER_CPUS. The steal target is the queue whose head has the lowestvirtual_finish_nsamong candidate sibling queues — that is the most overdue thread that another CPU has not yet dispatched, and the same selection rule the local pick uses. Ties break by lower CPU id. The steal pops oneThreadReffrom the source queue’s front, recomputesvirtual_finish_nsat the destination, and inserts at the destination ordered position. - Restore
WakePolicy::QueueCpu(usize)(or the WFQ equivalent placement variant) so endpoint, timer, park, process-wait, and thread-join completions can target a specific per-CPU queue. The single-global-queueWakePolicy::QueueAnyremains as the fallback underCAPOS_SCHED_DISABLE_WFQ=1. - Add
CAPOS_SCHED_DISABLE_WFQ=1as a runtime opt-out for one bisect cycle; remove before Phase E. - Validate:
cargo build --features qemu,cargo build --features qemu,measure,cargo test-lib,cargo test-ring-loom,make run-spawn,make run-smp2-smokes.
Task 4: Migration fairness and weight propagation
- Verify (and document) that
virtual_runtime_nstravels with the thread on every migration. The accounting record already encodes this; the WFQ enqueue path must explicitly recomputevirtual_finish_nsfrom the vruntime and weight at the destination, never carry it as committed state. - Increment
ThreadCpuAccounting.migrationson each cross-CPU enqueue, both for placement-time spread and for steal. Mirror the pre-collapse counter shape. - Prove that a thread whose weight changes through
SchedulingPolicyCap.setWeightwhile it is enqueued observes the new weight on the next dequeue and re-enqueue; the weight must not be cached invirtual_finish_nsacross blocking. - Validate:
make run-spawn,make run-smp2-smokes,make run-thread-scale(single-iteration functional check; the milestone gate runs in Task 6).
Task 5: Test matrix smokes
- CPU hogs. Reuse
make run-thread-scale. Add an assertion path for the equal-weight case (existing shape) and add a focused assertion for differing weights (e.g., asystem-thread-fairness.cuethat spawns three worker threads with weights2:1:1and asserts that the observed runtime ratio after a fixed window is approximately2:1:1within bench tolerance). The thread-fairness manifest is a NEW focused-proof manifest undercue/defaults/-style scaffolding. - Short sleepers. Add a focused QEMU proof
(
make run-thread-fairness-interactiveis one option) that spawns one CPU-hog worker plus one Timer-sleeper worker, both at default weight, with the sleeper at latency-classInteractive. Assert that the sleeper’s observed wake-to-run latency stays below a configured bound (one quantum’s worth as a starting target; tighten with bench evidence) and that the latency does not regress under contention. - Direct IPC server/client pairs. Reuse
make run-spawn(which already exercises endpoint direct-IPC). Add an assertion that a server thread woken by an endpoint CALL keeps paired-call timing comparable to the recorded baseline. The direct-IPC preference slot must keep its existing generation-checked semantics under WFQ. - Multi-process load. Reuse
make run-smp-process-scale. The recorded1.6x1-to-2 gate must hold under WFQ. - Same-process sibling load. This is the same shape
as
make run-thread-scalefrom Task 6; the milestone gate covers it. - Validate: each new smoke under
make run-*passes; existing smokes remain green.
Task 6: Milestone gate — controlled make run-thread-scale
- Run a 5-run controlled
make run-thread-scaleoncapos-bench, pinned to physical-core logical CPUs0,1,2,3, against the post-WFQ kernel. Use the same benchmark shape as the recorded2026-05-02 21:38 UTCpair: blocking parent join, 262,144 blocks (16 MiB),work_rounds=64,CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1,CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1. - Required outcome: capOS work speedup of at least
2.5xat 1-to-4 (the recorded baseline is1.566x). The 1-to-2 row must keep the configured1.6xgate. Total speedup is reported as diagnostic and must not regress below the recorded1.538x1-to-4 baseline. - Rerun the matching Linux pthread baseline
(
make run-linux-thread-scale-baseline) on the same pin set so the comparison stays apples-to-apples; the Linux number is informational, not a gate. - Capture raw artifacts under
target/thread-scale/<timestamp>/andtarget/linux-thread-scale/<timestamp>/. Record the pair indocs/changelog.mdPhase D entry. - If the gate is not met: do not weaken the threshold.
Diagnose the remaining bottleneck (scheduler-lock hold
time, steal scan cost, weight-application latency,
per-CPU queue contention) and submit a follow-up slice
under this plan; the gate stays at
2.5x.
Task 7: Documentation and closeout
- Update
docs/architecture/scheduling.mdto describe the per-CPU runnable queue, the WFQ ordering rule, the migration/steal contract, theSchedulingPolicyCapcap surface, and the new runnable-ownership invariants. Mark the single-global-queueWakePolicy::QueueAnyandCAPOS_SCHED_DISABLE_WFQ=1fallback as historical once retired. - Update
docs/proposals/scheduler-evolution-proposal.mdStage 3 status to “first slice landed” with commit hash and minute-precision timestamp; keep the EEVDF deferred follow-on note. - Update
docs/backlog/scheduler-evolution.mdPhase D bullets with closeout stamps for each item; add the new “Phase D follow-on: EEVDF migration” item under Phase D so the deferred work is tracked. - Update
docs/roadmap.mdscheduler section to reflect Phase D landed; sequence Phase E next. - Update
WORKPLAN.mdto remove the active “Scheduler Phase D” bullet and add a “Scheduler Phase D landed (closeout)” bullet referencing the commit and themake run-thread-scaleevidence pair. - Update
docs/plans/README.mdTrack Map row to mark this plan completed and move it todocs/plans/completed/per the directory’s lifecycle contract. - Add a
docs/changelog.mdentry under “Scheduler Phase D landed” with the recordedmake run-thread-scaleevidence pair, the matching Linux baseline, and the commit hash. - Validate: every command under “Validation Commands” above passes; the closeout commit lands clean.