Plan: Scheduler Phase D — Weighted-Fair Best-Effort Scheduling
Overview
Implementation track for the Phase D best-effort fair-share policy
chosen in docs/proposals/scheduler-evolution-proposal.md “Phase D
first-policy decision (2026-05-05 19:00 UTC)”. The selected policy is
weighted fair queueing (WFQ) on top of the existing per-thread
runtime_ns / virtual_runtime_ns accounting, with reintroduced
per-CPU runnable queues and a capability-authorized
SchedulingPolicyCap for weight and latency-class mutation. EEVDF
remains the deferred follow-on once the WFQ slice has accepted
thread-scale evidence and the open Phase E SchedulingContext work
has not yet displaced fair-share-only ordering.
The proposal section linked above is the design source of truth for the policy choice, the rejected alternative (EEVDF-first), the capability surface, the migration fairness sketch, the test matrix, and overload behavior. The tasks below decompose that design into implementation work; each task ends with the matching validation gate. The plan is ad-hoc until selected and then runs as the selected scheduler milestone.
This completed plan replaced the bare WORKPLAN bullet “Scheduler Phase D –
best-effort fair scheduling” while the WFQ slice was active. Phase E
(SchedulingContext) and Phase F (auto-nohz / SQPOLL) keep their own
backlog/plan ownership; this plan does not extend into them.
Conflict Surface
Historical ownership while this plan was active:
kernel/src/sched.rs(per-CPUrun_queuereintroduction, WFQ ordering helpers, migration/steal path, capability-authorized weight/class mutation hooks).kernel/src/process.rs(Thread.weight,Thread.latency_class,Thread.virtual_finish_nsfield additions; default values match the current single-class FIFO behavior).kernel/src/cap/sched_policy.rs(NEW kernel cap implementation forSchedulingPolicyCap).kernel/src/cap/mod.rs(cap registration only).schema/capos.capnp(newLatencyClassenum andSchedulingPolicyCapinterface; queues on the shared serial surface perdocs/plans/README.mdConcurrency Notes).tools/generated/(regenerated capnp bindings via the existingmake generated-code-checkgate).capos-rt/src/client.rs(newSchedulingPolicyClienttyped wrapper for the userspace runtime).capos-config/src/manifest.rsand the matching schema additions for manifest-grantedSchedulingPolicyCaprecords.tools/qemu-thread-scale-harness.sh(only if the harness needs WFQ-specific assertions; if a separate fairness smoke is added, it gets its owntools/qemu-*-smoke.sh).docs/proposals/scheduler-evolution-proposal.md(status updates and Phase D closeout stamps only; the design content already landed with this plan).docs/plans/completed/scheduler-phase-d.md(this file).docs/plans/README.mdTrack Map row for this plan.docs/architecture/scheduling.md(state-of-implementation updates as per-CPU queues and WFQ ordering land; mark “single global runnable queue” as historical when the per-CPU split returns).docs/backlog/scheduler-evolution.mdPhase D bullets and the matching closeout stamps.
Historical coordinated overlap with sibling tracks:
schema/capos.capnp: serialise on the shared serial surface perdocs/plans/README.mdConcurrency Notes. Phase D adds new interface entries and did not run concurrently with another schema-touching plan.kernel/src/cap/: this plan adds one new cap module (sched_policy.rs) and touches the cap registration list. Other active plans that touchedkernel/src/cap/(Device Driver Foundation, POSIX P1.2/P1.3) were kernel-core-serial work.kernel/src/sched.rs: this plan owned scheduler-core changes. Other plans did not modify the runnable queue, dispatch state, or weight/class fields while this plan was active.kernel/src/process.rs: this plan added Thread fields. Other plans did not modify Thread state during the active slice.
Do not touch from this plan:
kernel/src/cap/sched_context.rs(Phase E surface; not yet written, owned by the future Phase E plan).kernel/src/cap/cpu_isolation_lease.rs(Phase F surface; not yet written, owned by the future Phase F plan).kernel/src/cap/realtime_island.rs(Phase G surface).- Userspace policy service (Phase H); the Phase D cap surface must be Phase H-consumable but Phase D does not build the policy service itself.
tools/remote-session-client/(owned by remote-session plan).docs/topics.md(auto-regenerated; never edit manually).- Any unrelated proposal/plan file.
Validation Commands
make fmt-checkmake generated-code-checkcargo build --features qemucargo build --features qemu,measurecargo test-configcargo test-libcargo test-ring-loomcargo build-demos-caposmake capos-rt-checkmake run-smokemake run-spawnmake run-smp2-smokesmake run-thread-scale(the milestone gate; must materially close the recorded 1-to-4 capOS-vs-Linux gap)make run-smp-process-scale(regression gate; must keep the recorded 1-to-2 1.6x speedup against the multi-process proof)make run-measure(regression gate; the new accounting fields must not break the existing measure-mode proof line)
Success Criteria
Phase D is recorded done when:
- The
SchedulingPolicyCapinterface is inschema/capos.capnp, the kernel cap implementation is inkernel/src/cap/sched_policy.rs, the manifest grant path is wired throughcapos-config, and the userspace typed client is incapos-rt/src/client.rs. A focused QEMU smoke proves a manifest-granted cap can mutate weight and latency class on a targetThreadHandleand that a stale or revoked cap fails closed. - Per-CPU runnable queues are reintroduced under the WFQ
ordering rule. The single-global-queue fallback remained
selectable via
CAPOS_SCHED_DISABLE_WFQ=1for one bisect cycle and was retired by Phase E preflight. - Migration preserves
virtual_runtime_ns(already per-thread) and recomputesvirtual_finish_nsat destination enqueue. The bounded steal path scans each sibling queue by index for that queue’s first Runnable-for-destination entry (because each queue is ordered ascending byvirtual_finish_ns, the first Runnable hit per queue is the lowest candidate the destination can accept on that source) and picks the queue whose first- Runnable candidate has the lowestvirtual_finish_nsglobally (the most overdue work another CPU has not yet dispatched), with ties broken by lower CPU id, matching the fair-share rule the local pick uses. - Materially close the 1-to-4 capOS-vs-Linux thread-scale gap.
Concretely: a 5-run controlled
make run-thread-scaleagainst the post-WFQ kernel, pinned to physical-core logical CPUs0,1,2,3oncapos-bench, must record capOS work speedup of at least2.5xat 1-to-4 (the recorded baseline is1.566x; Linux records3.963xagainst the same shape on the same pin set). The 1-to-2 row must keep the configured1.6xgate. Total speedup is reported as diagnostic and must not regress below the recorded1.538x1-to-4 baseline. make run-spawn,make run-smp2-smokes,make run-smp-process-scale, andmake run-measureremain green. The recorded multi-process1.6x1-to-2 gate from2026-04-30must hold.docs/proposals/scheduler-evolution-proposal.mdPhase D section,docs/backlog/scheduler-evolution.mdPhase D bullets,docs/architecture/scheduling.md,docs/changelog.md,WORKPLAN.md, anddocs/roadmap.mdcarry the closeout stamp with commit hash and minute-precision timestamp.
The plan is not scoped to deliver Phase E
(SchedulingContext budget/period authority), Phase F
(CpuIsolationLease and SQPOLL nohz), Phase G
(RealtimeIsland), or Phase H (userspace policy service). Those
are sequenced after Phase D and own their own plan files.
Task 1: Schema and capability surface
Status: landed 2026-05-07 21:59 UTC at commit cb8c58b1
(sched(phase-d-task1): schema + capability surface). Tasks 2-4
are unblocked. Tasks 4-6 are unblocked by Task 3 (2026-05-07 23:45
UTC).
- Add the
invalidArgumentvariant to the existingExceptionTypeenum inschema/capos.capnp. The current enum has onlyfailed/overloaded/disconnected/unimplemented;setWeightpolicy denial below needs a distinct typed signal (caller bug rejection vs general failure vs back-pressure). This addition is part of the Phase D schema-surface acquisition documented in the proposal Phase D capability surface section. Keep the variant ordering stable for ABI compatibility. - Add the
LatencyClassenum (interactive,normal,batch,ipcServer) toschema/capos.capnpand regenerate bindings viamake generated-code-check. - Add the
SchedulingPolicyCapinterface withsetWeight,setLatencyClass, andsnapshotmethods. The snapshot return is narrow:weight,class,runtimeNs,virtualRuntimeNs. Those four fields are the ones Task 2 promotes out ofcfg(feature = "measure")unconditionally. Do NOT addcontextSwitches,preemptions,voluntaryBlocks, ormigrationsto the ABI in this slice; those counters stay benchmark-only and would either fail to compile in the normalqemubuild or expose fields the kernel does not track. A future operator-observability slice may add them through a separate snapshot cap. - Implement
setWeightvalidation at the cap boundary (not the dispatch path) with the rule from the proposal:weight = 0and any nonzero value outside[MIN_WEIGHT, MAX_WEIGHT](Phase D constants) are rejected withCapException::InvalidArgument. The kernel does NOT silently clamp out-of-range values; a future caller/test can rely on the rejection signal. This ensures no later divide-by-zero or overflow path is reachable through the cap. Implementation lives inkernel/src/cap/sched_policy.rs; the typed exception flows through the existingkernel/src/cap/ring.rsdispatcher via a sentinel-prefix channel because capnp 0.25 has noErrorKind::InvalidArgumentvariant and the enum is#[non_exhaustive]. - Add a
KernelCapSource::SchedulingPolicyvariant under the manifest grant path so a manifest can grant the cap to a named process. Phase D grants the cap only to focused- proof manifests (system-thread-fairness.cueand similar Task 5 smokes); the default boot manifest does NOT grant the cap in this slice. Wider authority (cross-process weight/class mutation, default-grant to a userspace policy service) belongs to the future Phase H plan. - Add a
capos-rt::client::SchedulingPolicyClienttyped wrapper that maps transport errors andCapExceptiondecode shape consistently with the existing clients. - Validate:
make fmt-check,make generated-code-check,cargo test-config,cargo test-lib,cargo build --features qemu(warning-free).
Task 2: Per-thread weight and latency-class state
Closeout: 2026-05-07 22:51 UTC. Cap-state binding decision:
context-derived caller-thread fallback. The cap routes every
method (setWeight, setLatencyClass, snapshot) to
CapCallContext::caller_thread; cross-thread/cross-process
mutation is deferred to Phase H. Phase D constants moved from
kernel/src/cap/sched_policy.rs into capos-abi/src/scheduler.rs
(MIN_WEIGHT, MAX_WEIGHT, DEFAULT_WEIGHT, REFERENCE_WEIGHT,
plus the new INTERACTIVE_SLICE_DIVISOR = 2 and
BATCH_SLICE_MULTIPLIER = 4).
- Add
weight: u16andlatency_class: LatencyClassfields toThread(inkernel/src/process.rs), with default values matching the current single-class behavior (weight = DEFAULT_WEIGHT,latency_class = LatencyClass::Normal). These fields must be unconditional (not behindcfg(feature = "measure")) because they participate in dispatch ordering. - Promote
runtime_ns,virtual_runtime_ns, andlast_started_nsfromThreadCpuAccountingout ofcfg(feature = "measure")so the WFQ ordering, the runtime-charge path, and thesnapshotcap method work in the normalqemubuild. Thecontext_switches,preemptions,voluntary_blocks, andmigrationscounters stay behind themeasurefeature and are NOT exposed throughSchedulingPolicyCap.snapshotin this slice. Documented indocs/architecture/scheduling.md. - Change the
charge_runtimestep sovirtual_runtime_nsadvances byelapsed_ns * REFERENCE_WEIGHT / weightinstead of 1:1 withelapsed_ns.runtime_nscontinues to advance 1:1 with elapsed time so monotonic CPU accounting andsnapshot.runtimeNsare unchanged. This is the actual fairness mechanism; without it, weights affect only enqueue- order ties rather than cumulative share. - Add
virtual_finish_ns: u64derived per enqueue and not stored across blocking. The derivation rule depends onlatency_classper the proposal’s “Latency-class semantics for Phase D” subsection:NormalandIpcServerusevruntime + slice_ns * REFERENCE_WEIGHT / weight;Interactiveusesvruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) * REFERENCE_WEIGHT / weight;Batchusesvruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT / weight.slice_nsis the existingcrate::arch::context::TICK_NSquantum;REFERENCE_WEIGHT,MIN_WEIGHT,MAX_WEIGHT,DEFAULT_WEIGHT,INTERACTIVE_SLICE_DIVISOR, andBATCH_SLICE_MULTIPLIERlive incapos-abi/src/scheduler.rs. - Add the kernel-side mutation entry points behind the
SchedulingPolicyCapdispatch only. No ambient process field, no per-process default, no syscall path that bypasses the cap. Caller-thread binding throughCapCallContext::caller_thread; idle thread and stale thread refs return the standardCallerNotLivefailure that surfaces to userspace as the disconnected-classCapExceptiontaxonomy entry. - Validate:
cargo build --features qemu,cargo build --features qemu,measure,cargo test-lib,make capos-rt-check. Regression:make run-smoke,make run-spawn,make run-measure.
Task 3: Per-CPU run queues and WFQ ordering
Closeout: 2026-05-07 23:45 UTC. Per-CPU run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] reintroduced ordered ascending
by Thread.virtual_finish_ns via linear-scan insert (chosen for
simplicity at SCHEDULER_CPUS = 4; promotion to a smarter structure
deferred until benchmark evidence requires it). Each per-CPU queue is
reserved to the live runnable-capable thread count before publication
so the bounded steal path can migrate every live thread into a single
sibling queue without allocating in timer, unblock, direct-IPC
fallback, or steal-requeue paths. Local selection scans the local
queue by index for the first destination-Runnable entry, leaving
RetryLater entries in place so the dispatch pass cannot starve
runnable entries behind a non-runnable head whose
virtual_finish_ns has not changed. The bounded steal path scans
each sibling queue by index for that queue’s first Runnable-for-
destination entry, then picks the queue whose first-Runnable
candidate has the lowest virtual_finish_ns globally (ties by lower
CPU id), which prevents stranded runnable work behind a sibling-head
RetryLater or single-CPU-owner constraint. The chosen entry is
removed from its actual position on the source queue, the WFQ tag
is recomputed at the destination, and the entry is inserted at the
destination’s ordered position. WakePolicy::QueueCpu(u32) is
reinstated for endpoint, timer, park, process-wait, thread-join, and
process-spawn completions; Phase D initially kept the fallback
WakePolicy::QueueAny as the build-time opt-out under
CAPOS_SCHED_DISABLE_WFQ=1 (option_env!). The steal path remained
active under the opt-out so siblings drained queue 0, restoring the
pre-Task-3 single-global-queue behaviour on SMP. Migration-counter
increments and weight-change-after-block proof remain in Task 4
scope; the milestone gate (make run-thread-scale 2.5x 1-to-4)
remains in Task 6 scope.
- Reintroduce
SchedulerDispatch.run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS](the per-CPU bounded queues retired in the 2026-05-02 collapse). Reuse the documented runnable-ownership invariants indocs/architecture/scheduling.md(single dispatch owner per liveThreadRefacross per-CPUcurrent/handoff_currentslots, the per-CPU run queues, and the direct IPC target slot). - Reintroduce the per-CPU live-reservation accounting that the pre-collapse design used: reserve all per-CPU queues to the live runnable-capable thread count before publication; release on process/thread exit or pre-publication rollback. Timer, unblock, direct-IPC fallback, and steal-requeue paths must remain allocation-free.
- Order each per-CPU
VecDequeascending byvirtual_finish_ns: enqueue inserts at the ordered position, selection picks the front (lowestvirtual_finish_ns= most overdue against fair share). The exact ordering structure (sorted insert vs. small bucket array) is an implementation choice; document the decision indocs/architecture/scheduling.md. Linear- scan insert chosen. - Restore the bounded steal path: a CPU whose local queue
has no runnable entry walks sibling per-CPU queues
bounded by
SCHEDULER_CPUS. The scan walks each sibling queue’s indices ascending for that queue’s first Runnable-for-destination entry; because the queue is ordered ascending byvirtual_finish_ns, the first hit is the lowestvirtual_finish_nscandidate the destination can accept on that source. The steal target is the queue whose first-Runnable candidate has the lowestvirtual_finish_nsglobally — that is the most overdue thread another CPU has not yet dispatched, and the same selection rule the local pick uses. Ties break by lower CPU id. The steal removes that entry from its actual position on the source queue (not necessarily the head: a RetryLater or single-CPU-owner thread may sit at the front and stay there), recomputesvirtual_finish_nsat the destination, and inserts at the destination ordered position. The first-Runnable- per-queue scan is required so a non-runnable sibling head does not strand later runnable entries behind it. - Restore
WakePolicy::QueueCpu(u32)(or the WFQ equivalent placement variant) so endpoint, timer, park, process-wait, and thread-join completions can target a specific per-CPU queue. The single-global-queueWakePolicy::QueueAnywas retained as the one-bisect fallback underCAPOS_SCHED_DISABLE_WFQ=1. - Add
CAPOS_SCHED_DISABLE_WFQ=1as a build-time opt-out (option_env!) for one bisect cycle; remove before Phase E. - Validate:
cargo build --features qemu,cargo build --features qemu,measure,cargo test-lib,cargo test-ring-loom,make run-spawn,make run-smp2-smokes. Plus regression:make run-smoke,make run-measure,make capos-rt-check,make fmt-check,make generated-code-check,cargo test-config.
Task 4: Migration fairness and weight propagation
Closeout: 2026-05-08 00:53 UTC.
- Verify (and document) that
virtual_runtime_nstravels with the thread on every migration. The accounting record already encodes this; the WFQ enqueue path must explicitly recomputevirtual_finish_nsfrom the vruntime and weight at the destination, never carry it as committed state. Verified by tracing every enqueue site (push_reserved_run_queue_lockedfor the initial-publish and post-block arms, plus the steal-insert atsteal_from_sibling_queues_locked); each routes throughrefresh_virtual_finish_ns_locked, which readsthread.weight,thread.latency_class, andthread.cpu_accounting.virtual_runtime_nsfresh and writesThread.virtual_finish_ns. The function bears an explicit doc-comment asserting the invariant; the steal site bears a matching block comment. - Increment
ThreadCpuAccounting.migrationson each cross-CPU enqueue, both for placement-time spread and for steal. Mirror the pre-collapse counter shape. Implemented asrecord_placement_spread_migration_locked(called frompush_reserved_run_queue_lockedwhen target slot differs fromThreadCpuAccounting.last_cpu) andrecord_steal_migration_locked(called from the steal arm unconditionally, since the scan skips the destination slot). The counter remainscfg(feature = "measure")-gated; the dispatch-timescheduled_measurepath no longer increments migrations and now only updateslast_cpuso the enqueue-time check has the previous CPU available. Steady-state shape mirrors the pre-collapse counter (a thread that runs on a different CPU than its previous run still records exactly one migration); the increment is now attributed to the enqueue decision rather than the dispatch that follows. - Prove that a thread whose weight changes through
SchedulingPolicyCap.setWeightwhile it is enqueued observes the new weight on the next dequeue and re-enqueue; the weight must not be cached invirtual_finish_nsacross blocking. Proved by construction:setWeightwritesThread.weightdirectly without touchingvirtual_finish_ns, and every enqueue site refreshes the WFQ tag from the current weight/ latency-class/vruntime triple. Reinforced by an inlinedebug_assert!inProcess::refresh_thread_virtual_finish_nsthat the recomputedvirtual_finish_nsis at or beyond the currentvirtual_runtime_ns(a future deadline, never a past one). A focused QEMU smoke that drivessetWeightagainst an enqueued thread and verifies the post-block dispatch ordering is recorded as a Task 5 follow-up; seedocs/architecture/scheduling.mdPhase D Task 4 section. - Validate:
make run-spawn,make run-smp2-smokes,make run-thread-scale(single-iteration functional check; the milestone gate runs in Task 6).
Task 5: Test matrix smokes
- CPU hogs. Reuse
make run-thread-scalefor the equal-weight functional path. The existing harness already records per-caseworkandtotalcycles and the2026-05-02 21:38 UTC1-to-2 baseline still gates the1.6xevidence under WFQ. The differing-weights focused proof lives in the NEWmake run-thread-fairnesstarget driven bysystem-thread-fairness.cueand the demo atdemos/thread-fairness/. The demo spawns three worker threads at WFQ weights128:64:64(the abiDEFAULT_WEIGHT=64doubled for the heavy worker keeps the assertion2:1:1while staying insideMIN_WEIGHT..=MAX_WEIGHT), runs each worker as a CPU-hog spinner under a fixed wallclock window, asks the kernel for each thread’sruntime_nsviaSchedulingPolicyCap.snapshot, and asserts that the observed ratio falls inside a±20%tolerance window around the weight-proportional target. The harnesstools/qemu-thread-fairness-smoke.sh assert fairnesschecks the demo emitted its[thread-fairness] window_ns=summary line with three nonzeroruntime_nsvalues and that the demo’s per-worker tolerance passfairness ratio ok within 20%succeeded. - Short sleepers. Closed by the
make run-thread-fairness-interactivetarget driven bysystem-thread-fairness-interactive.cue. The same demo binary spawns one CPU-hog worker (default weight, Normal class) plus one Timer-sleeper worker (default weight, Interactive latency class). The sleeper repeatedly callsTimer.sleepfor a known short interval, computes observed wake-to-run latency asnow_after - now_before - sleep_ns, drops the first four “settle” rounds, and asserts that the maximum observed latency stays below4 * TICK_NS(40 ms). The harnesstools/qemu-thread-fairness-smoke.sh assert interactiveverifies the demo’s bound check passed and theinteractive latency ok max=summary line was printed. The bound is intentionally generous starting from4 * TICK_NSso flake-rate is acceptable on KVM-less QEMU; tighten with bench evidence in a follow-up if needed. - Direct IPC server/client pairs.
make run-spawn(and its qemu-spawn-smoke harness) remains the regression gate. The direct-IPC preference slot’s generation-checked semantics are unchanged under WFQ (Task 3 review confirmed: theWakePolicy::QueueCpuplacement intent travels through the same direct-IPC handoff, and the per-CPU dispatch still polls the preference slot before the run-queue front), so the Task 5 contribution here is a regression assertion via the existing harness rather than new explicit paired-call timing. A timing-delta assertion against a historical baseline is recorded as a Task 5 follow-up (would require recording paired-call medians per build and a per-host noise window; out of scope for this slice). - Multi-process load. Reuse
make run-smp-process-scale. The recorded1.6x1-to-2 gate continues to hold under WFQ. If a future run trips the gate, the failure blocks Task 6 progression and indicates a WFQ regression that must be diagnosed (steal-scan cost, weight-application latency, per-CPU queue contention) rather than relaxed. - Weight-change-while-enqueued QEMU smoke (Task 4
deferral). Closed by the
make run-thread-fairness-weight-changetarget driven bysystem-thread-fairness-weight-change.cue. Two competing child threads run a fixed wallclock window: the baseline worker stays atDEFAULT_WEIGHT, while the heavy worker self-callsSchedulingPolicyCap.setWeight(weight=128)and then blocks onTimer.sleepso it leaves the run queue before the contention window opens. Each worker snapshots its scheduler state at wake and at window end viaSchedulingPolicyCap.snapshot, and the parent asserts three independent things: (1) the heavy snapshot readsweight == 128and the baseline readsweight == DEFAULT_WEIGHT; (2) the observedruntime_nsratio under contention matches the weight ratio (target 2:1) within±25%; (3) the heavy worker’svirtual_runtime_nsadvances at half the rate of itsruntime_ns(vruntime/runtime ~= 0.5 for weight=128 vs ~= 1.0 for the DEFAULT_WEIGHT baseline) within±30%. The third check is the smoking gun for a stale-weight regression: a scheduler that kept the pre-setWeightweight insidecharge_runtimewould yield heavy vruntime/runtime ~= 1.0 instead of ~= 0.5, and the assertion would trip even if WFQ ordering self- corrected after the first dispatch. Together with the runtime-ratio assertion this exercises the Task 4 invariant that every enqueue site (and dispatch charging path) reads the currentweight/latency_class/virtual_runtime_nstriple rather than reusing a cached value. Note thatSchedulingPolicyCapis bound toCapCallContext::caller_threadper Task 2, so the thread mutating its own weight is the only authorized shape for the proof; cross-thread weight mutation is a Phase H privileged scheduler-policy service concern. - Same-process sibling load. This is the same shape
as
make run-thread-scalefrom Task 6; the milestone gate covers it. - Validate: each new smoke under
make run-*passes; existing smokes remain green.
Closeout: 2026-05-08 02:00 UTC. Test infrastructure only; no
kernel scheduler logic changes.
Task 6: Milestone gate — controlled make run-thread-scale
- Run a 5-run controlled
make run-thread-scaleoncapos-bench, pinned to physical-core logical CPUs0,1,2,3, against the post-WFQ kernel. Use the same benchmark shape as the recorded2026-05-02 21:38 UTCpair: blocking parent join, 262,144 blocks (16 MiB),work_rounds=64,CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1,CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1. - Required outcome: capOS work speedup of at least
2.5xat 1-to-4 (the recorded baseline is1.566x). The 1-to-2 row must keep the configured1.6xgate. Total speedup is reported as diagnostic and must not regress below the recorded1.538x1-to-4 baseline. - Rerun the matching Linux pthread baseline
(
make run-linux-thread-scale-baseline) on the same pin set so the comparison stays apples-to-apples; the Linux number is informational, not a gate. - Capture raw artifacts under
target/thread-scale/<timestamp>/andtarget/linux-thread-scale/<timestamp>/. Record the pair indocs/changelog.mdPhase D entry. - If the gate is not met: do not weaken the threshold.
Diagnose the remaining bottleneck (scheduler-lock hold
time, steal scan cost, weight-application latency,
per-CPU queue contention) and submit a follow-up slice
under this plan; the gate stays at
2.5x.
Closeout: 2026-05-10 19:46 UTC. On the benchmark VM at
branch commit 76025f0963a4, the 5-run controlled capOS
thread-scale gate passed with 1-to-4 work speedup 3.088x
and total speedup 2.700x; the 1-to-2 row kept the accepted
1.6x work/total gate at 1.809x / 1.774x. The matching
Linux pthread baseline on the same physical-core logical CPU
set recorded 1-to-4 work/total speedups 3.974x / 3.850x.
Raw artifacts are under target/thread-scale/20260510T193200Z/
and target/linux-thread-scale/20260510T194600Z/.
Task 7: Documentation and closeout
Closeout: Phase D passed its Task 6 evidence gate at commit 77caafc0
(2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate)
and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC,
docs(scheduler): close phase d). The evidence commit records the controlled
benchmark pair from Task 6. The completed
plan moved to docs/plans/completed/scheduler-phase-d.md; Phase E
SchedulingContext is sequenced next, Phase F auto-nohz / SQPOLL /
tickless idle follows Phase E, and generic full-nohz remains deferred
behind those prerequisites. EEVDF is retained as a follow-on policy
evaluation, not a Phase D blocker. Phase D closeout left the
transitional CAPOS_SCHED_DISABLE_WFQ=1 / WakePolicy::QueueAny
fallback as a one-bisect-cycle source cleanup; the Phase E preflight
cleanup has since removed it before SchedulingContext work claims
the scheduler surface.
- Update
docs/architecture/scheduling.mdto describe the per-CPU runnable queue, the WFQ ordering rule, the migration/steal contract, theSchedulingPolicyCapcap surface, and the new runnable-ownership invariants. At Phase D closeout, record the transitional single-global-queueWakePolicy::QueueAnyandCAPOS_SCHED_DISABLE_WFQ=1fallback as still present and scheduled for Phase E preflight removal rather than claiming it was retired by that docs-only slice. - Update
docs/proposals/scheduler-evolution-proposal.mdStage 3 status to “first slice landed” with commit hash and minute-precision timestamp; keep the EEVDF deferred follow-on note. - Update
docs/backlog/scheduler-evolution.mdPhase D bullets with closeout stamps for each item; add the new “Phase D follow-on: EEVDF migration” item under Phase D so the deferred work is tracked. - Update
docs/roadmap.mdscheduler section to reflect Phase D landed; sequence Phase E next. - Update
WORKPLAN.mdto remove the active “Scheduler Phase D” bullet and add a “Scheduler Phase D landed (closeout)” bullet referencing the commit and themake run-thread-scaleevidence pair. - Update
docs/plans/README.mdTrack Map row to mark this plan completed and move it todocs/plans/completed/per the directory’s lifecycle contract. - Add a
docs/changelog.mdentry under “Scheduler Phase D landed” with the recordedmake run-thread-scaleevidence pair, the matching Linux baseline, and the commit hash. - Validate: docs-closeout checks are
git diff --check,make workflow-check, andmake docs. The behavior, generated-code, and QEMU gates above were already satisfied by the implementation and benchmark slices and were intentionally not rerun for this docs-status closeout.