Plan: Scheduler Phase D — Weighted-Fair Best-Effort Scheduling

Overview

Implementation track for the Phase D best-effort fair-share policy chosen in docs/proposals/scheduler-evolution-proposal.md “Phase D first-policy decision (2026-05-05 19:00 UTC)”. The selected policy is weighted fair queueing (WFQ) on top of the existing per-thread runtime_ns / virtual_runtime_ns accounting, with reintroduced per-CPU runnable queues and a capability-authorized SchedulingPolicyCap for weight and latency-class mutation. EEVDF remains the deferred follow-on once the WFQ slice has accepted thread-scale evidence and the open Phase E SchedulingContext work has not yet displaced fair-share-only ordering.

The proposal section linked above is the design source of truth for the policy choice, the rejected alternative (EEVDF-first), the capability surface, the migration fairness sketch, the test matrix, and overload behavior. The tasks below decompose that design into implementation work; each task ends with the matching validation gate. The plan is ad-hoc until selected and then runs as the selected scheduler milestone.

This plan replaces the bare WORKPLAN bullet “Scheduler Phase D – best-effort fair scheduling” once the design slice merges. Phase E (SchedulingContext) and Phase F (auto-nohz / SQPOLL) keep their own backlog/plan ownership; this plan does not extend into them.

Conflict Surface

Owned by this plan:

kernel/src/sched.rs (per-CPU run_queue reintroduction, WFQ ordering helpers, migration/steal path, capability-authorized weight/class mutation hooks).
kernel/src/process.rs (Thread.weight, Thread.latency_class, Thread.virtual_finish_ns field additions; default values match the current single-class FIFO behavior).
kernel/src/cap/sched_policy.rs (NEW kernel cap implementation for SchedulingPolicyCap).
kernel/src/cap/mod.rs (cap registration only).
schema/capos.capnp (new LatencyClass enum and SchedulingPolicyCap interface; queues on the shared serial surface per docs/plans/README.md Concurrency Notes).
tools/generated/ (regenerated capnp bindings via the existing make generated-code-check gate).
capos-rt/src/client.rs (new SchedulingPolicyClient typed wrapper for the userspace runtime).
capos-config/src/manifest.rs and the matching schema additions for manifest-granted SchedulingPolicyCap records.
tools/qemu-thread-scale-harness.sh (only if the harness needs WFQ-specific assertions; if a separate fairness smoke is added, it gets its own tools/qemu-*-smoke.sh).
docs/proposals/scheduler-evolution-proposal.md (status updates and Phase D closeout stamps only; the design content already landed with this plan).
docs/plans/scheduler-phase-d.md (this file).
docs/plans/README.md Track Map row for this plan.
docs/architecture/scheduling.md (state-of-implementation updates as per-CPU queues and WFQ ordering land; mark “single global runnable queue” as historical when the per-CPU split returns).
docs/backlog/scheduler-evolution.md Phase D bullets and the matching closeout stamps.

Coordinated overlap with sibling tracks:

schema/capos.capnp: serialise on the shared serial surface per docs/plans/README.md Concurrency Notes. Phase D adds new interface entries and must not run concurrently with another schema-touching plan.
kernel/src/cap/: this plan adds one new cap module (sched_policy.rs) and touches the cap registration list. Other active plans that touch kernel/src/cap/ (Device Driver Foundation, POSIX P1.2/P1.3) are kernel-core-serial work; do not run them concurrently with this plan.
kernel/src/sched.rs: this plan owns scheduler-core changes. Other plans must not modify the runnable queue, dispatch state, or weight/class fields while this plan is active.
kernel/src/process.rs: this plan adds Thread fields. Other plans must not modify Thread state during the active slice.

Do not touch from this plan:

kernel/src/cap/sched_context.rs (Phase E surface; not yet written, owned by the future Phase E plan).
kernel/src/cap/cpu_isolation_lease.rs (Phase F surface; not yet written, owned by the future Phase F plan).
kernel/src/cap/realtime_island.rs (Phase G surface).
Userspace policy service (Phase H); the Phase D cap surface must be Phase H-consumable but Phase D does not build the policy service itself.
tools/remote-session-client/ (owned by remote-session plan).
docs/topics.md (auto-regenerated; never edit manually).
Any unrelated proposal/plan file.

Validation Commands

make fmt-check
make generated-code-check
cargo build --features qemu
cargo build --features qemu,measure
cargo test-config
cargo test-lib
cargo test-ring-loom
cargo build-demos-capos
make capos-rt-check
make run-smoke
make run-spawn
make run-smp2-smokes
make run-thread-scale (the milestone gate; must materially close the recorded 1-to-4 capOS-vs-Linux gap)
make run-smp-process-scale (regression gate; must keep the recorded 1-to-2 1.6x speedup against the multi-process proof)
make run-measure (regression gate; the new accounting fields must not break the existing measure-mode proof line)

Success Criteria

Phase D is recorded done when:

The SchedulingPolicyCap interface is in schema/capos.capnp, the kernel cap implementation is in kernel/src/cap/sched_policy.rs, the manifest grant path is wired through capos-config, and the userspace typed client is in capos-rt/src/client.rs. A focused QEMU smoke proves a manifest-granted cap can mutate weight and latency class on a target ThreadHandle and that a stale or revoked cap fails closed.
Per-CPU runnable queues are reintroduced under the WFQ ordering rule. The single-global-queue fallback remains selectable via CAPOS_SCHED_DISABLE_WFQ=1 for one bisect cycle and is retired before Phase E.
Migration preserves virtual_runtime_ns (already per-thread) and recomputes virtual_finish_ns at destination enqueue. The bounded steal path picks the source queue whose head has the lowest virtual_finish_ns (the most overdue work another CPU has not yet dispatched), matching the local pick rule (front of the ascending per-CPU queue).
Materially close the 1-to-4 capOS-vs-Linux thread-scale gap. Concretely: a 5-run controlled make run-thread-scale against the post-WFQ kernel, pinned to physical-core logical CPUs 0,1,2,3 on capos-bench, must record capOS work speedup of at least 2.5x at 1-to-4 (the recorded baseline is 1.566x; Linux records 3.963x against the same shape on the same pin set). The 1-to-2 row must keep the configured 1.6x gate. Total speedup is reported as diagnostic and must not regress below the recorded 1.538x 1-to-4 baseline.
make run-spawn, make run-smp2-smokes, make run-smp-process-scale, and make run-measure remain green. The recorded multi-process 1.6x 1-to-2 gate from 2026-04-30 must hold.
docs/proposals/scheduler-evolution-proposal.md Phase D section, docs/backlog/scheduler-evolution.md Phase D bullets, docs/architecture/scheduling.md, docs/changelog.md, WORKPLAN.md, and docs/roadmap.md carry the closeout stamp with commit hash and minute-precision timestamp.

The plan is not scoped to deliver Phase E (SchedulingContext budget/period authority), Phase F (CpuIsolationLease and SQPOLL nohz), Phase G (RealtimeIsland), or Phase H (userspace policy service). Those are sequenced after Phase D and own their own plan files.

Task 1: Schema and capability surface

Add the invalidArgument variant to the existing ExceptionType enum in schema/capos.capnp. The current enum has only failed/overloaded/disconnected/ unimplemented; setWeight policy denial below needs a distinct typed signal (caller bug rejection vs general failure vs back-pressure). This addition is part of the Phase D schema-surface acquisition documented in the proposal Phase D capability surface section. Keep the variant ordering stable for ABI compatibility.
Add the LatencyClass enum (interactive, normal, batch, ipcServer) to schema/capos.capnp and regenerate bindings via make generated-code-check.
Add the SchedulingPolicyCap interface with setWeight, setLatencyClass, and snapshot methods. The snapshot return is narrow: weight, class, runtimeNs, virtualRuntimeNs. Those four fields are the ones Task 2 promotes out of cfg(feature = "measure") unconditionally. Do NOT add contextSwitches, preemptions, voluntaryBlocks, or migrations to the ABI in this slice; those counters stay benchmark-only and would either fail to compile in the normal qemu build or expose fields the kernel does not track. A future operator-observability slice may add them through a separate snapshot cap.
Implement setWeight validation at the cap boundary (not the dispatch path) with the rule from the proposal: weight = 0 and any nonzero value outside [MIN_WEIGHT, MAX_WEIGHT] (Phase D constants) are rejected with CapException::InvalidArgument. The kernel does NOT silently clamp out-of-range values; a future caller/test can rely on the rejection signal. This ensures no later divide-by-zero or overflow path is reachable through the cap.
Add a KernelCapSource::SchedulingPolicy variant under the manifest grant path so a manifest can grant the cap to a named process. Phase D grants the cap only to focused- proof manifests (system-thread-fairness.cue and similar Task 5 smokes); the default boot manifest does NOT grant the cap in this slice. Wider authority (cross-process weight/class mutation, default-grant to a userspace policy service) belongs to the future Phase H plan.
Add a capos-rt::client::SchedulingPolicyClient typed wrapper that maps transport errors and CapException decode shape consistently with the existing clients.
Validate: make fmt-check, make generated-code-check, cargo test-config, cargo test-lib, cargo build --features qemu (warning-free).

Task 2: Per-thread weight and latency-class state

Add weight: u16 and latency_class: LatencyClass fields to Thread (in kernel/src/process.rs), with default values matching the current single-class behavior (weight = DEFAULT_WEIGHT, latency_class = LatencyClass::Normal). These fields must be unconditional (not behind cfg(feature = "measure")) because they participate in dispatch ordering.
Promote runtime_ns, virtual_runtime_ns, and last_started_ns from ThreadCpuAccounting out of cfg(feature = "measure") so the WFQ ordering, the runtime-charge path, and the snapshot cap method work in the normal qemu build. The context_switches, preemptions, voluntary_blocks, and migrations counters stay behind the measure feature and are NOT exposed through SchedulingPolicyCap.snapshot in this slice. Document the choice in docs/architecture/scheduling.md.
Change the charge_runtime step so virtual_runtime_ns advances by elapsed_ns * REFERENCE_WEIGHT / weight instead of 1:1 with elapsed_ns. runtime_ns continues to advance 1:1 with elapsed time so monotonic CPU accounting and snapshot.runtimeNs are unchanged. This is the actual fairness mechanism; without it, weights affect only enqueue- order ties rather than cumulative share.
Add virtual_finish_ns: u64 derived per enqueue and not stored across blocking. The derivation rule depends on latency_class per the proposal’s “Latency-class semantics for Phase D” subsection: Normal and IpcServer use vruntime + slice_ns * REFERENCE_WEIGHT / weight; Interactive uses vruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) * REFERENCE_WEIGHT / weight; Batch uses vruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT / weight. slice_ns, REFERENCE_WEIGHT, MIN_WEIGHT, MAX_WEIGHT, DEFAULT_WEIGHT, INTERACTIVE_SLICE_DIVISOR, and BATCH_SLICE_MULTIPLIER are Phase D constants.
Add the kernel-side mutation entry points behind the SchedulingPolicyCap dispatch only. No ambient process field, no per-process default, no syscall path that bypasses the cap.
Validate: cargo build --features qemu, cargo build --features qemu,measure, cargo test-lib, make capos-rt-check.

Task 3: Per-CPU run queues and WFQ ordering

Reintroduce SchedulerDispatch.run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] (the per-CPU bounded queues retired in the 2026-05-02 collapse). Reuse the documented runnable-ownership invariants in docs/architecture/scheduling.md (single dispatch owner per live ThreadRef across per-CPU current/ handoff_current slots, the per-CPU run queues, and the direct IPC target slot).
Reintroduce the per-CPU live-reservation accounting that the pre-collapse design used: reserve all per-CPU queues to the live runnable-capable thread count before publication; release on process/thread exit or pre-publication rollback. Timer, unblock, direct-IPC fallback, and steal-requeue paths must remain allocation-free.
Order each per-CPU VecDeque ascending by virtual_finish_ns: enqueue inserts at the ordered position, selection picks the front (lowest virtual_finish_ns = most overdue against fair share). The exact ordering structure (sorted insert vs. small bucket array) is an implementation choice; document the decision in docs/architecture/scheduling.md.
Restore the bounded steal path: a CPU whose local queue is empty walks sibling per-CPU queues bounded by SCHEDULER_CPUS. The steal target is the queue whose head has the lowest virtual_finish_ns among candidate sibling queues — that is the most overdue thread that another CPU has not yet dispatched, and the same selection rule the local pick uses. Ties break by lower CPU id. The steal pops one ThreadRef from the source queue’s front, recomputes virtual_finish_ns at the destination, and inserts at the destination ordered position.
Restore WakePolicy::QueueCpu(usize) (or the WFQ equivalent placement variant) so endpoint, timer, park, process-wait, and thread-join completions can target a specific per-CPU queue. The single-global-queue WakePolicy::QueueAny remains as the fallback under CAPOS_SCHED_DISABLE_WFQ=1.
Add CAPOS_SCHED_DISABLE_WFQ=1 as a runtime opt-out for one bisect cycle; remove before Phase E.
Validate: cargo build --features qemu, cargo build --features qemu,measure, cargo test-lib, cargo test-ring-loom, make run-spawn, make run-smp2-smokes.

Task 4: Migration fairness and weight propagation

Verify (and document) that virtual_runtime_ns travels with the thread on every migration. The accounting record already encodes this; the WFQ enqueue path must explicitly recompute virtual_finish_ns from the vruntime and weight at the destination, never carry it as committed state.
Increment ThreadCpuAccounting.migrations on each cross-CPU enqueue, both for placement-time spread and for steal. Mirror the pre-collapse counter shape.
Prove that a thread whose weight changes through SchedulingPolicyCap.setWeight while it is enqueued observes the new weight on the next dequeue and re-enqueue; the weight must not be cached in virtual_finish_ns across blocking.
Validate: make run-spawn, make run-smp2-smokes, make run-thread-scale (single-iteration functional check; the milestone gate runs in Task 6).

Task 5: Test matrix smokes

CPU hogs. Reuse make run-thread-scale. Add an assertion path for the equal-weight case (existing shape) and add a focused assertion for differing weights (e.g., a system-thread-fairness.cue that spawns three worker threads with weights 2:1:1 and asserts that the observed runtime ratio after a fixed window is approximately 2:1:1 within bench tolerance). The thread-fairness manifest is a NEW focused-proof manifest under cue/defaults/-style scaffolding.
Short sleepers. Add a focused QEMU proof (make run-thread-fairness-interactive is one option) that spawns one CPU-hog worker plus one Timer-sleeper worker, both at default weight, with the sleeper at latency-class Interactive. Assert that the sleeper’s observed wake-to-run latency stays below a configured bound (one quantum’s worth as a starting target; tighten with bench evidence) and that the latency does not regress under contention.
Direct IPC server/client pairs. Reuse make run-spawn (which already exercises endpoint direct-IPC). Add an assertion that a server thread woken by an endpoint CALL keeps paired-call timing comparable to the recorded baseline. The direct-IPC preference slot must keep its existing generation-checked semantics under WFQ.
Multi-process load. Reuse make run-smp-process-scale. The recorded 1.6x 1-to-2 gate must hold under WFQ.
Same-process sibling load. This is the same shape as make run-thread-scale from Task 6; the milestone gate covers it.
Validate: each new smoke under make run-* passes; existing smokes remain green.

Task 6: Milestone gate — controlled `make run-thread-scale`

Run a 5-run controlled make run-thread-scale on capos-bench, pinned to physical-core logical CPUs 0,1,2,3, against the post-WFQ kernel. Use the same benchmark shape as the recorded 2026-05-02 21:38 UTC pair: blocking parent join, 262,144 blocks (16 MiB), work_rounds=64, CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1, CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1.
Required outcome: capOS work speedup of at least 2.5x at 1-to-4 (the recorded baseline is 1.566x). The 1-to-2 row must keep the configured 1.6x gate. Total speedup is reported as diagnostic and must not regress below the recorded 1.538x 1-to-4 baseline.
Rerun the matching Linux pthread baseline (make run-linux-thread-scale-baseline) on the same pin set so the comparison stays apples-to-apples; the Linux number is informational, not a gate.
Capture raw artifacts under target/thread-scale/<timestamp>/ and target/linux-thread-scale/<timestamp>/. Record the pair in docs/changelog.md Phase D entry.
If the gate is not met: do not weaken the threshold. Diagnose the remaining bottleneck (scheduler-lock hold time, steal scan cost, weight-application latency, per-CPU queue contention) and submit a follow-up slice under this plan; the gate stays at 2.5x.

Task 7: Documentation and closeout

Update docs/architecture/scheduling.md to describe the per-CPU runnable queue, the WFQ ordering rule, the migration/steal contract, the SchedulingPolicyCap cap surface, and the new runnable-ownership invariants. Mark the single-global-queue WakePolicy::QueueAny and CAPOS_SCHED_DISABLE_WFQ=1 fallback as historical once retired.
Update docs/proposals/scheduler-evolution-proposal.md Stage 3 status to “first slice landed” with commit hash and minute-precision timestamp; keep the EEVDF deferred follow-on note.
Update docs/backlog/scheduler-evolution.md Phase D bullets with closeout stamps for each item; add the new “Phase D follow-on: EEVDF migration” item under Phase D so the deferred work is tracked.
Update docs/roadmap.md scheduler section to reflect Phase D landed; sequence Phase E next.
Update WORKPLAN.md to remove the active “Scheduler Phase D” bullet and add a “Scheduler Phase D landed (closeout)” bullet referencing the commit and the make run-thread-scale evidence pair.
Update docs/plans/README.md Track Map row to mark this plan completed and move it to docs/plans/completed/ per the directory’s lifecycle contract.
Add a docs/changelog.md entry under “Scheduler Phase D landed” with the recorded make run-thread-scale evidence pair, the matching Linux baseline, and the commit hash.
Validate: every command under “Validation Commands” above passes; the closeout commit lands clean.

Keyboard shortcuts

capOS Documentation