Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Plan: Scheduler Phase D — Weighted-Fair Best-Effort Scheduling

Overview

Implementation track for the Phase D best-effort fair-share policy chosen in docs/proposals/scheduler-evolution-proposal.md “Phase D first-policy decision (2026-05-05 19:00 UTC)”. The selected policy is weighted fair queueing (WFQ) on top of the existing per-thread runtime_ns / virtual_runtime_ns accounting, with reintroduced per-CPU runnable queues and a capability-authorized SchedulingPolicyCap for weight and latency-class mutation. EEVDF remains the deferred follow-on once the WFQ slice has accepted thread-scale evidence and the open Phase E SchedulingContext work has not yet displaced fair-share-only ordering.

The proposal section linked above is the design source of truth for the policy choice, the rejected alternative (EEVDF-first), the capability surface, the migration fairness sketch, the test matrix, and overload behavior. The tasks below decompose that design into implementation work; each task ends with the matching validation gate. The plan is ad-hoc until selected and then runs as the selected scheduler milestone.

This completed plan replaced the bare WORKPLAN bullet “Scheduler Phase D – best-effort fair scheduling” while the WFQ slice was active. Phase E (SchedulingContext) and Phase F (auto-nohz / SQPOLL) keep their own backlog/plan ownership; this plan does not extend into them.

Conflict Surface

Historical ownership while this plan was active:

  • kernel/src/sched.rs (per-CPU run_queue reintroduction, WFQ ordering helpers, migration/steal path, capability-authorized weight/class mutation hooks).
  • kernel/src/process.rs (Thread.weight, Thread.latency_class, Thread.virtual_finish_ns field additions; default values match the current single-class FIFO behavior).
  • kernel/src/cap/sched_policy.rs (NEW kernel cap implementation for SchedulingPolicyCap).
  • kernel/src/cap/mod.rs (cap registration only).
  • schema/capos.capnp (new LatencyClass enum and SchedulingPolicyCap interface; queues on the shared serial surface per docs/plans/README.md Concurrency Notes).
  • tools/generated/ (regenerated capnp bindings via the existing make generated-code-check gate).
  • capos-rt/src/client.rs (new SchedulingPolicyClient typed wrapper for the userspace runtime).
  • capos-config/src/manifest.rs and the matching schema additions for manifest-granted SchedulingPolicyCap records.
  • tools/qemu-thread-scale-harness.sh (only if the harness needs WFQ-specific assertions; if a separate fairness smoke is added, it gets its own tools/qemu-*-smoke.sh).
  • docs/proposals/scheduler-evolution-proposal.md (status updates and Phase D closeout stamps only; the design content already landed with this plan).
  • docs/plans/completed/scheduler-phase-d.md (this file).
  • docs/plans/README.md Track Map row for this plan.
  • docs/architecture/scheduling.md (state-of-implementation updates as per-CPU queues and WFQ ordering land; mark “single global runnable queue” as historical when the per-CPU split returns).
  • docs/backlog/scheduler-evolution.md Phase D bullets and the matching closeout stamps.

Historical coordinated overlap with sibling tracks:

  • schema/capos.capnp: serialise on the shared serial surface per docs/plans/README.md Concurrency Notes. Phase D adds new interface entries and did not run concurrently with another schema-touching plan.
  • kernel/src/cap/: this plan adds one new cap module (sched_policy.rs) and touches the cap registration list. Other active plans that touched kernel/src/cap/ (Device Driver Foundation, POSIX P1.2/P1.3) were kernel-core-serial work.
  • kernel/src/sched.rs: this plan owned scheduler-core changes. Other plans did not modify the runnable queue, dispatch state, or weight/class fields while this plan was active.
  • kernel/src/process.rs: this plan added Thread fields. Other plans did not modify Thread state during the active slice.

Do not touch from this plan:

  • kernel/src/cap/sched_context.rs (Phase E surface; not yet written, owned by the future Phase E plan).
  • kernel/src/cap/cpu_isolation_lease.rs (Phase F surface; not yet written, owned by the future Phase F plan).
  • kernel/src/cap/realtime_island.rs (Phase G surface).
  • Userspace policy service (Phase H); the Phase D cap surface must be Phase H-consumable but Phase D does not build the policy service itself.
  • tools/remote-session-client/ (owned by remote-session plan).
  • docs/topics.md (auto-regenerated; never edit manually).
  • Any unrelated proposal/plan file.

Validation Commands

  • make fmt-check
  • make generated-code-check
  • cargo build --features qemu
  • cargo build --features qemu,measure
  • cargo test-config
  • cargo test-lib
  • cargo test-ring-loom
  • cargo build-demos-capos
  • make capos-rt-check
  • make run-smoke
  • make run-spawn
  • make run-smp2-smokes
  • make run-thread-scale (the milestone gate; must materially close the recorded 1-to-4 capOS-vs-Linux gap)
  • make run-smp-process-scale (regression gate; must keep the recorded 1-to-2 1.6x speedup against the multi-process proof)
  • make run-measure (regression gate; the new accounting fields must not break the existing measure-mode proof line)

Success Criteria

Phase D is recorded done when:

  1. The SchedulingPolicyCap interface is in schema/capos.capnp, the kernel cap implementation is in kernel/src/cap/sched_policy.rs, the manifest grant path is wired through capos-config, and the userspace typed client is in capos-rt/src/client.rs. A focused QEMU smoke proves a manifest-granted cap can mutate weight and latency class on a target ThreadHandle and that a stale or revoked cap fails closed.
  2. Per-CPU runnable queues are reintroduced under the WFQ ordering rule. The single-global-queue fallback remained selectable via CAPOS_SCHED_DISABLE_WFQ=1 for one bisect cycle and was retired by Phase E preflight.
  3. Migration preserves virtual_runtime_ns (already per-thread) and recomputes virtual_finish_ns at destination enqueue. The bounded steal path scans each sibling queue by index for that queue’s first Runnable-for-destination entry (because each queue is ordered ascending by virtual_finish_ns, the first Runnable hit per queue is the lowest candidate the destination can accept on that source) and picks the queue whose first- Runnable candidate has the lowest virtual_finish_ns globally (the most overdue work another CPU has not yet dispatched), with ties broken by lower CPU id, matching the fair-share rule the local pick uses.
  4. Materially close the 1-to-4 capOS-vs-Linux thread-scale gap. Concretely: a 5-run controlled make run-thread-scale against the post-WFQ kernel, pinned to physical-core logical CPUs 0,1,2,3 on capos-bench, must record capOS work speedup of at least 2.5x at 1-to-4 (the recorded baseline is 1.566x; Linux records 3.963x against the same shape on the same pin set). The 1-to-2 row must keep the configured 1.6x gate. Total speedup is reported as diagnostic and must not regress below the recorded 1.538x 1-to-4 baseline.
  5. make run-spawn, make run-smp2-smokes, make run-smp-process-scale, and make run-measure remain green. The recorded multi-process 1.6x 1-to-2 gate from 2026-04-30 must hold.
  6. docs/proposals/scheduler-evolution-proposal.md Phase D section, docs/backlog/scheduler-evolution.md Phase D bullets, docs/architecture/scheduling.md, docs/changelog.md, WORKPLAN.md, and docs/roadmap.md carry the closeout stamp with commit hash and minute-precision timestamp.

The plan is not scoped to deliver Phase E (SchedulingContext budget/period authority), Phase F (CpuIsolationLease and SQPOLL nohz), Phase G (RealtimeIsland), or Phase H (userspace policy service). Those are sequenced after Phase D and own their own plan files.

Task 1: Schema and capability surface

Status: landed 2026-05-07 21:59 UTC at commit cb8c58b1 (sched(phase-d-task1): schema + capability surface). Tasks 2-4 are unblocked. Tasks 4-6 are unblocked by Task 3 (2026-05-07 23:45 UTC).

  • Add the invalidArgument variant to the existing ExceptionType enum in schema/capos.capnp. The current enum has only failed/overloaded/disconnected/ unimplemented; setWeight policy denial below needs a distinct typed signal (caller bug rejection vs general failure vs back-pressure). This addition is part of the Phase D schema-surface acquisition documented in the proposal Phase D capability surface section. Keep the variant ordering stable for ABI compatibility.
  • Add the LatencyClass enum (interactive, normal, batch, ipcServer) to schema/capos.capnp and regenerate bindings via make generated-code-check.
  • Add the SchedulingPolicyCap interface with setWeight, setLatencyClass, and snapshot methods. The snapshot return is narrow: weight, class, runtimeNs, virtualRuntimeNs. Those four fields are the ones Task 2 promotes out of cfg(feature = "measure") unconditionally. Do NOT add contextSwitches, preemptions, voluntaryBlocks, or migrations to the ABI in this slice; those counters stay benchmark-only and would either fail to compile in the normal qemu build or expose fields the kernel does not track. A future operator-observability slice may add them through a separate snapshot cap.
  • Implement setWeight validation at the cap boundary (not the dispatch path) with the rule from the proposal: weight = 0 and any nonzero value outside [MIN_WEIGHT, MAX_WEIGHT] (Phase D constants) are rejected with CapException::InvalidArgument. The kernel does NOT silently clamp out-of-range values; a future caller/test can rely on the rejection signal. This ensures no later divide-by-zero or overflow path is reachable through the cap. Implementation lives in kernel/src/cap/sched_policy.rs; the typed exception flows through the existing kernel/src/cap/ring.rs dispatcher via a sentinel-prefix channel because capnp 0.25 has no ErrorKind::InvalidArgument variant and the enum is #[non_exhaustive].
  • Add a KernelCapSource::SchedulingPolicy variant under the manifest grant path so a manifest can grant the cap to a named process. Phase D grants the cap only to focused- proof manifests (system-thread-fairness.cue and similar Task 5 smokes); the default boot manifest does NOT grant the cap in this slice. Wider authority (cross-process weight/class mutation, default-grant to a userspace policy service) belongs to the future Phase H plan.
  • Add a capos-rt::client::SchedulingPolicyClient typed wrapper that maps transport errors and CapException decode shape consistently with the existing clients.
  • Validate: make fmt-check, make generated-code-check, cargo test-config, cargo test-lib, cargo build --features qemu (warning-free).

Task 2: Per-thread weight and latency-class state

Closeout: 2026-05-07 22:51 UTC. Cap-state binding decision: context-derived caller-thread fallback. The cap routes every method (setWeight, setLatencyClass, snapshot) to CapCallContext::caller_thread; cross-thread/cross-process mutation is deferred to Phase H. Phase D constants moved from kernel/src/cap/sched_policy.rs into capos-abi/src/scheduler.rs (MIN_WEIGHT, MAX_WEIGHT, DEFAULT_WEIGHT, REFERENCE_WEIGHT, plus the new INTERACTIVE_SLICE_DIVISOR = 2 and BATCH_SLICE_MULTIPLIER = 4).

  • Add weight: u16 and latency_class: LatencyClass fields to Thread (in kernel/src/process.rs), with default values matching the current single-class behavior (weight = DEFAULT_WEIGHT, latency_class = LatencyClass::Normal). These fields must be unconditional (not behind cfg(feature = "measure")) because they participate in dispatch ordering.
  • Promote runtime_ns, virtual_runtime_ns, and last_started_ns from ThreadCpuAccounting out of cfg(feature = "measure") so the WFQ ordering, the runtime-charge path, and the snapshot cap method work in the normal qemu build. The context_switches, preemptions, voluntary_blocks, and migrations counters stay behind the measure feature and are NOT exposed through SchedulingPolicyCap.snapshot in this slice. Documented in docs/architecture/scheduling.md.
  • Change the charge_runtime step so virtual_runtime_ns advances by elapsed_ns * REFERENCE_WEIGHT / weight instead of 1:1 with elapsed_ns. runtime_ns continues to advance 1:1 with elapsed time so monotonic CPU accounting and snapshot.runtimeNs are unchanged. This is the actual fairness mechanism; without it, weights affect only enqueue- order ties rather than cumulative share.
  • Add virtual_finish_ns: u64 derived per enqueue and not stored across blocking. The derivation rule depends on latency_class per the proposal’s “Latency-class semantics for Phase D” subsection: Normal and IpcServer use vruntime + slice_ns * REFERENCE_WEIGHT / weight; Interactive uses vruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) * REFERENCE_WEIGHT / weight; Batch uses vruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT / weight. slice_ns is the existing crate::arch::context::TICK_NS quantum; REFERENCE_WEIGHT, MIN_WEIGHT, MAX_WEIGHT, DEFAULT_WEIGHT, INTERACTIVE_SLICE_DIVISOR, and BATCH_SLICE_MULTIPLIER live in capos-abi/src/scheduler.rs.
  • Add the kernel-side mutation entry points behind the SchedulingPolicyCap dispatch only. No ambient process field, no per-process default, no syscall path that bypasses the cap. Caller-thread binding through CapCallContext::caller_thread; idle thread and stale thread refs return the standard CallerNotLive failure that surfaces to userspace as the disconnected-class CapException taxonomy entry.
  • Validate: cargo build --features qemu, cargo build --features qemu,measure, cargo test-lib, make capos-rt-check. Regression: make run-smoke, make run-spawn, make run-measure.

Task 3: Per-CPU run queues and WFQ ordering

Closeout: 2026-05-07 23:45 UTC. Per-CPU run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] reintroduced ordered ascending by Thread.virtual_finish_ns via linear-scan insert (chosen for simplicity at SCHEDULER_CPUS = 4; promotion to a smarter structure deferred until benchmark evidence requires it). Each per-CPU queue is reserved to the live runnable-capable thread count before publication so the bounded steal path can migrate every live thread into a single sibling queue without allocating in timer, unblock, direct-IPC fallback, or steal-requeue paths. Local selection scans the local queue by index for the first destination-Runnable entry, leaving RetryLater entries in place so the dispatch pass cannot starve runnable entries behind a non-runnable head whose virtual_finish_ns has not changed. The bounded steal path scans each sibling queue by index for that queue’s first Runnable-for- destination entry, then picks the queue whose first-Runnable candidate has the lowest virtual_finish_ns globally (ties by lower CPU id), which prevents stranded runnable work behind a sibling-head RetryLater or single-CPU-owner constraint. The chosen entry is removed from its actual position on the source queue, the WFQ tag is recomputed at the destination, and the entry is inserted at the destination’s ordered position. WakePolicy::QueueCpu(u32) is reinstated for endpoint, timer, park, process-wait, thread-join, and process-spawn completions; Phase D initially kept the fallback WakePolicy::QueueAny as the build-time opt-out under CAPOS_SCHED_DISABLE_WFQ=1 (option_env!). The steal path remained active under the opt-out so siblings drained queue 0, restoring the pre-Task-3 single-global-queue behaviour on SMP. Migration-counter increments and weight-change-after-block proof remain in Task 4 scope; the milestone gate (make run-thread-scale 2.5x 1-to-4) remains in Task 6 scope.

  • Reintroduce SchedulerDispatch.run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] (the per-CPU bounded queues retired in the 2026-05-02 collapse). Reuse the documented runnable-ownership invariants in docs/architecture/scheduling.md (single dispatch owner per live ThreadRef across per-CPU current/ handoff_current slots, the per-CPU run queues, and the direct IPC target slot).
  • Reintroduce the per-CPU live-reservation accounting that the pre-collapse design used: reserve all per-CPU queues to the live runnable-capable thread count before publication; release on process/thread exit or pre-publication rollback. Timer, unblock, direct-IPC fallback, and steal-requeue paths must remain allocation-free.
  • Order each per-CPU VecDeque ascending by virtual_finish_ns: enqueue inserts at the ordered position, selection picks the front (lowest virtual_finish_ns = most overdue against fair share). The exact ordering structure (sorted insert vs. small bucket array) is an implementation choice; document the decision in docs/architecture/scheduling.md. Linear- scan insert chosen.
  • Restore the bounded steal path: a CPU whose local queue has no runnable entry walks sibling per-CPU queues bounded by SCHEDULER_CPUS. The scan walks each sibling queue’s indices ascending for that queue’s first Runnable-for-destination entry; because the queue is ordered ascending by virtual_finish_ns, the first hit is the lowest virtual_finish_ns candidate the destination can accept on that source. The steal target is the queue whose first-Runnable candidate has the lowest virtual_finish_ns globally — that is the most overdue thread another CPU has not yet dispatched, and the same selection rule the local pick uses. Ties break by lower CPU id. The steal removes that entry from its actual position on the source queue (not necessarily the head: a RetryLater or single-CPU-owner thread may sit at the front and stay there), recomputes virtual_finish_ns at the destination, and inserts at the destination ordered position. The first-Runnable- per-queue scan is required so a non-runnable sibling head does not strand later runnable entries behind it.
  • Restore WakePolicy::QueueCpu(u32) (or the WFQ equivalent placement variant) so endpoint, timer, park, process-wait, and thread-join completions can target a specific per-CPU queue. The single-global-queue WakePolicy::QueueAny was retained as the one-bisect fallback under CAPOS_SCHED_DISABLE_WFQ=1.
  • Add CAPOS_SCHED_DISABLE_WFQ=1 as a build-time opt-out (option_env!) for one bisect cycle; remove before Phase E.
  • Validate: cargo build --features qemu, cargo build --features qemu,measure, cargo test-lib, cargo test-ring-loom, make run-spawn, make run-smp2-smokes. Plus regression: make run-smoke, make run-measure, make capos-rt-check, make fmt-check, make generated-code-check, cargo test-config.

Task 4: Migration fairness and weight propagation

Closeout: 2026-05-08 00:53 UTC.

  • Verify (and document) that virtual_runtime_ns travels with the thread on every migration. The accounting record already encodes this; the WFQ enqueue path must explicitly recompute virtual_finish_ns from the vruntime and weight at the destination, never carry it as committed state. Verified by tracing every enqueue site (push_reserved_run_queue_locked for the initial-publish and post-block arms, plus the steal-insert at steal_from_sibling_queues_locked); each routes through refresh_virtual_finish_ns_locked, which reads thread.weight, thread.latency_class, and thread.cpu_accounting.virtual_runtime_ns fresh and writes Thread.virtual_finish_ns. The function bears an explicit doc-comment asserting the invariant; the steal site bears a matching block comment.
  • Increment ThreadCpuAccounting.migrations on each cross-CPU enqueue, both for placement-time spread and for steal. Mirror the pre-collapse counter shape. Implemented as record_placement_spread_migration_locked (called from push_reserved_run_queue_locked when target slot differs from ThreadCpuAccounting.last_cpu) and record_steal_migration_locked (called from the steal arm unconditionally, since the scan skips the destination slot). The counter remains cfg(feature = "measure")-gated; the dispatch-time scheduled_measure path no longer increments migrations and now only updates last_cpu so the enqueue-time check has the previous CPU available. Steady-state shape mirrors the pre-collapse counter (a thread that runs on a different CPU than its previous run still records exactly one migration); the increment is now attributed to the enqueue decision rather than the dispatch that follows.
  • Prove that a thread whose weight changes through SchedulingPolicyCap.setWeight while it is enqueued observes the new weight on the next dequeue and re-enqueue; the weight must not be cached in virtual_finish_ns across blocking. Proved by construction: setWeight writes Thread.weight directly without touching virtual_finish_ns, and every enqueue site refreshes the WFQ tag from the current weight/ latency-class/vruntime triple. Reinforced by an inline debug_assert! in Process::refresh_thread_virtual_finish_ns that the recomputed virtual_finish_ns is at or beyond the current virtual_runtime_ns (a future deadline, never a past one). A focused QEMU smoke that drives setWeight against an enqueued thread and verifies the post-block dispatch ordering is recorded as a Task 5 follow-up; see docs/architecture/scheduling.md Phase D Task 4 section.
  • Validate: make run-spawn, make run-smp2-smokes, make run-thread-scale (single-iteration functional check; the milestone gate runs in Task 6).

Task 5: Test matrix smokes

  • CPU hogs. Reuse make run-thread-scale for the equal-weight functional path. The existing harness already records per-case work and total cycles and the 2026-05-02 21:38 UTC 1-to-2 baseline still gates the 1.6x evidence under WFQ. The differing-weights focused proof lives in the NEW make run-thread-fairness target driven by system-thread-fairness.cue and the demo at demos/thread-fairness/. The demo spawns three worker threads at WFQ weights 128:64:64 (the abi DEFAULT_WEIGHT=64 doubled for the heavy worker keeps the assertion 2:1:1 while staying inside MIN_WEIGHT..=MAX_WEIGHT), runs each worker as a CPU-hog spinner under a fixed wallclock window, asks the kernel for each thread’s runtime_ns via SchedulingPolicyCap.snapshot, and asserts that the observed ratio falls inside a ±20% tolerance window around the weight-proportional target. The harness tools/qemu-thread-fairness-smoke.sh assert fairness checks the demo emitted its [thread-fairness] window_ns= summary line with three nonzero runtime_ns values and that the demo’s per-worker tolerance pass fairness ratio ok within 20% succeeded.
  • Short sleepers. Closed by the make run-thread-fairness-interactive target driven by system-thread-fairness-interactive.cue. The same demo binary spawns one CPU-hog worker (default weight, Normal class) plus one Timer-sleeper worker (default weight, Interactive latency class). The sleeper repeatedly calls Timer.sleep for a known short interval, computes observed wake-to-run latency as now_after - now_before - sleep_ns, drops the first four “settle” rounds, and asserts that the maximum observed latency stays below 4 * TICK_NS (40 ms). The harness tools/qemu-thread-fairness-smoke.sh assert interactive verifies the demo’s bound check passed and the interactive latency ok max= summary line was printed. The bound is intentionally generous starting from 4 * TICK_NS so flake-rate is acceptable on KVM-less QEMU; tighten with bench evidence in a follow-up if needed.
  • Direct IPC server/client pairs. make run-spawn (and its qemu-spawn-smoke harness) remains the regression gate. The direct-IPC preference slot’s generation-checked semantics are unchanged under WFQ (Task 3 review confirmed: the WakePolicy::QueueCpu placement intent travels through the same direct-IPC handoff, and the per-CPU dispatch still polls the preference slot before the run-queue front), so the Task 5 contribution here is a regression assertion via the existing harness rather than new explicit paired-call timing. A timing-delta assertion against a historical baseline is recorded as a Task 5 follow-up (would require recording paired-call medians per build and a per-host noise window; out of scope for this slice).
  • Multi-process load. Reuse make run-smp-process-scale. The recorded 1.6x 1-to-2 gate continues to hold under WFQ. If a future run trips the gate, the failure blocks Task 6 progression and indicates a WFQ regression that must be diagnosed (steal-scan cost, weight-application latency, per-CPU queue contention) rather than relaxed.
  • Weight-change-while-enqueued QEMU smoke (Task 4 deferral). Closed by the make run-thread-fairness-weight-change target driven by system-thread-fairness-weight-change.cue. Two competing child threads run a fixed wallclock window: the baseline worker stays at DEFAULT_WEIGHT, while the heavy worker self-calls SchedulingPolicyCap.setWeight(weight=128) and then blocks on Timer.sleep so it leaves the run queue before the contention window opens. Each worker snapshots its scheduler state at wake and at window end via SchedulingPolicyCap.snapshot, and the parent asserts three independent things: (1) the heavy snapshot reads weight == 128 and the baseline reads weight == DEFAULT_WEIGHT; (2) the observed runtime_ns ratio under contention matches the weight ratio (target 2:1) within ±25%; (3) the heavy worker’s virtual_runtime_ns advances at half the rate of its runtime_ns (vruntime/runtime ~= 0.5 for weight=128 vs ~= 1.0 for the DEFAULT_WEIGHT baseline) within ±30%. The third check is the smoking gun for a stale-weight regression: a scheduler that kept the pre-setWeight weight inside charge_runtime would yield heavy vruntime/runtime ~= 1.0 instead of ~= 0.5, and the assertion would trip even if WFQ ordering self- corrected after the first dispatch. Together with the runtime-ratio assertion this exercises the Task 4 invariant that every enqueue site (and dispatch charging path) reads the current weight/latency_class/virtual_runtime_ns triple rather than reusing a cached value. Note that SchedulingPolicyCap is bound to CapCallContext::caller_thread per Task 2, so the thread mutating its own weight is the only authorized shape for the proof; cross-thread weight mutation is a Phase H privileged scheduler-policy service concern.
  • Same-process sibling load. This is the same shape as make run-thread-scale from Task 6; the milestone gate covers it.
  • Validate: each new smoke under make run-* passes; existing smokes remain green.

Closeout: 2026-05-08 02:00 UTC. Test infrastructure only; no kernel scheduler logic changes.

Task 6: Milestone gate — controlled make run-thread-scale

  • Run a 5-run controlled make run-thread-scale on capos-bench, pinned to physical-core logical CPUs 0,1,2,3, against the post-WFQ kernel. Use the same benchmark shape as the recorded 2026-05-02 21:38 UTC pair: blocking parent join, 262,144 blocks (16 MiB), work_rounds=64, CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1, CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1.
  • Required outcome: capOS work speedup of at least 2.5x at 1-to-4 (the recorded baseline is 1.566x). The 1-to-2 row must keep the configured 1.6x gate. Total speedup is reported as diagnostic and must not regress below the recorded 1.538x 1-to-4 baseline.
  • Rerun the matching Linux pthread baseline (make run-linux-thread-scale-baseline) on the same pin set so the comparison stays apples-to-apples; the Linux number is informational, not a gate.
  • Capture raw artifacts under target/thread-scale/<timestamp>/ and target/linux-thread-scale/<timestamp>/. Record the pair in docs/changelog.md Phase D entry.
  • If the gate is not met: do not weaken the threshold. Diagnose the remaining bottleneck (scheduler-lock hold time, steal scan cost, weight-application latency, per-CPU queue contention) and submit a follow-up slice under this plan; the gate stays at 2.5x.

Closeout: 2026-05-10 19:46 UTC. On the benchmark VM at branch commit 76025f0963a4, the 5-run controlled capOS thread-scale gate passed with 1-to-4 work speedup 3.088x and total speedup 2.700x; the 1-to-2 row kept the accepted 1.6x work/total gate at 1.809x / 1.774x. The matching Linux pthread baseline on the same physical-core logical CPU set recorded 1-to-4 work/total speedups 3.974x / 3.850x. Raw artifacts are under target/thread-scale/20260510T193200Z/ and target/linux-thread-scale/20260510T194600Z/.

Task 7: Documentation and closeout

Closeout: Phase D passed its Task 6 evidence gate at commit 77caafc0 (2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate) and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d). The evidence commit records the controlled benchmark pair from Task 6. The completed plan moved to docs/plans/completed/scheduler-phase-d.md; Phase E SchedulingContext is sequenced next, Phase F auto-nohz / SQPOLL / tickless idle follows Phase E, and generic full-nohz remains deferred behind those prerequisites. EEVDF is retained as a follow-on policy evaluation, not a Phase D blocker. Phase D closeout left the transitional CAPOS_SCHED_DISABLE_WFQ=1 / WakePolicy::QueueAny fallback as a one-bisect-cycle source cleanup; the Phase E preflight cleanup has since removed it before SchedulingContext work claims the scheduler surface.

  • Update docs/architecture/scheduling.md to describe the per-CPU runnable queue, the WFQ ordering rule, the migration/steal contract, the SchedulingPolicyCap cap surface, and the new runnable-ownership invariants. At Phase D closeout, record the transitional single-global-queue WakePolicy::QueueAny and CAPOS_SCHED_DISABLE_WFQ=1 fallback as still present and scheduled for Phase E preflight removal rather than claiming it was retired by that docs-only slice.
  • Update docs/proposals/scheduler-evolution-proposal.md Stage 3 status to “first slice landed” with commit hash and minute-precision timestamp; keep the EEVDF deferred follow-on note.
  • Update docs/backlog/scheduler-evolution.md Phase D bullets with closeout stamps for each item; add the new “Phase D follow-on: EEVDF migration” item under Phase D so the deferred work is tracked.
  • Update docs/roadmap.md scheduler section to reflect Phase D landed; sequence Phase E next.
  • Update WORKPLAN.md to remove the active “Scheduler Phase D” bullet and add a “Scheduler Phase D landed (closeout)” bullet referencing the commit and the make run-thread-scale evidence pair.
  • Update docs/plans/README.md Track Map row to mark this plan completed and move it to docs/plans/completed/ per the directory’s lifecycle contract.
  • Add a docs/changelog.md entry under “Scheduler Phase D landed” with the recorded make run-thread-scale evidence pair, the matching Linux baseline, and the commit hash.
  • Validate: docs-closeout checks are git diff --check, make workflow-check, and make docs. The behavior, generated-code, and QEMU gates above were already satisfied by the implementation and benchmark slices and were intentionally not rerun for this docs-status closeout.