# Plan: Scheduler Phase D — Weighted-Fair Best-Effort Scheduling

## Overview

Implementation track for the Phase D best-effort fair-share policy
chosen in `docs/proposals/scheduler-evolution-proposal.md` "Phase D
first-policy decision (2026-05-05 19:00 UTC)". The selected policy is
weighted fair queueing (WFQ) on top of the existing per-thread
`runtime_ns` / `virtual_runtime_ns` accounting, with reintroduced
per-CPU runnable queues and a capability-authorized
`SchedulingPolicyCap` for weight and latency-class mutation. EEVDF
remains the deferred follow-on once the WFQ slice has accepted
thread-scale evidence and the open Phase E `SchedulingContext` work
has not yet displaced fair-share-only ordering.

The proposal section linked above is the design source of truth for
the policy choice, the rejected alternative (EEVDF-first), the
capability surface, the migration fairness sketch, the test matrix,
and overload behavior. The tasks below decompose that design into
implementation work; each task ends with the matching validation
gate. The plan is ad-hoc until selected and then runs as the
selected scheduler milestone.

This plan replaces the bare WORKPLAN bullet "Scheduler Phase D --
best-effort fair scheduling" once the design slice merges. Phase E
(`SchedulingContext`) and Phase F (auto-nohz / SQPOLL) keep their own
backlog/plan ownership; this plan does not extend into them.

## Conflict Surface

Owned by this plan:

- `kernel/src/sched.rs` (per-CPU `run_queue` reintroduction,
  WFQ ordering helpers, migration/steal path, capability-authorized
  weight/class mutation hooks).
- `kernel/src/process.rs` (`Thread.weight`, `Thread.latency_class`,
  `Thread.virtual_finish_ns` field additions; default values match
  the current single-class FIFO behavior).
- `kernel/src/cap/sched_policy.rs` (NEW kernel cap implementation
  for `SchedulingPolicyCap`).
- `kernel/src/cap/mod.rs` (cap registration only).
- `schema/capos.capnp` (new `LatencyClass` enum and
  `SchedulingPolicyCap` interface; queues on the shared serial
  surface per `docs/plans/README.md` Concurrency Notes).
- `tools/generated/` (regenerated capnp bindings via the existing
  `make generated-code-check` gate).
- `capos-rt/src/client.rs` (new `SchedulingPolicyClient` typed
  wrapper for the userspace runtime).
- `capos-config/src/manifest.rs` and the matching schema additions
  for manifest-granted `SchedulingPolicyCap` records.
- `tools/qemu-thread-scale-harness.sh` (only if the harness needs
  WFQ-specific assertions; if a separate fairness smoke is added,
  it gets its own `tools/qemu-*-smoke.sh`).
- `docs/proposals/scheduler-evolution-proposal.md` (status updates
  and Phase D closeout stamps only; the design content already
  landed with this plan).
- `docs/plans/scheduler-phase-d.md` (this file).
- `docs/plans/README.md` Track Map row for this plan.
- `docs/architecture/scheduling.md` (state-of-implementation
  updates as per-CPU queues and WFQ ordering land; mark
  "single global runnable queue" as historical when the per-CPU
  split returns).
- `docs/backlog/scheduler-evolution.md` Phase D bullets and the
  matching closeout stamps.

Coordinated overlap with sibling tracks:

- `schema/capos.capnp`: serialise on the shared serial surface
  per `docs/plans/README.md` Concurrency Notes. Phase D adds new
  interface entries and must not run concurrently with another
  schema-touching plan.
- `kernel/src/cap/`: this plan adds one new cap module
  (`sched_policy.rs`) and touches the cap registration list. Other
  active plans that touch `kernel/src/cap/` (Device Driver
  Foundation, POSIX P1.2/P1.3) are kernel-core-serial work; do
  not run them concurrently with this plan.
- `kernel/src/sched.rs`: this plan owns scheduler-core changes.
  Other plans must not modify the runnable queue, dispatch state,
  or weight/class fields while this plan is active.
- `kernel/src/process.rs`: this plan adds Thread fields. Other
  plans must not modify Thread state during the active slice.

Do not touch from this plan:

- `kernel/src/cap/sched_context.rs` (Phase E surface; not yet
  written, owned by the future Phase E plan).
- `kernel/src/cap/cpu_isolation_lease.rs` (Phase F surface; not
  yet written, owned by the future Phase F plan).
- `kernel/src/cap/realtime_island.rs` (Phase G surface).
- Userspace policy service (Phase H); the Phase D cap surface
  must be Phase H-consumable but Phase D does not build the
  policy service itself.
- `tools/remote-session-client/` (owned by remote-session plan).
- `docs/topics.md` (auto-regenerated; never edit manually).
- Any unrelated proposal/plan file.

## Validation Commands

- `make fmt-check`
- `make generated-code-check`
- `cargo build --features qemu`
- `cargo build --features qemu,measure`
- `cargo test-config`
- `cargo test-lib`
- `cargo test-ring-loom`
- `cargo build-demos-capos`
- `make capos-rt-check`
- `make run-smoke`
- `make run-spawn`
- `make run-smp2-smokes`
- `make run-thread-scale` (the milestone gate; must materially
  close the recorded 1-to-4 capOS-vs-Linux gap)
- `make run-smp-process-scale` (regression gate; must keep the
  recorded 1-to-2 1.6x speedup against the multi-process
  proof)
- `make run-measure` (regression gate; the new accounting
  fields must not break the existing measure-mode proof line)

## Success Criteria

Phase D is recorded done when:

1. The `SchedulingPolicyCap` interface is in `schema/capos.capnp`,
   the kernel cap implementation is in `kernel/src/cap/sched_policy.rs`,
   the manifest grant path is wired through `capos-config`, and
   the userspace typed client is in `capos-rt/src/client.rs`. A
   focused QEMU smoke proves a manifest-granted cap can mutate
   weight and latency class on a target `ThreadHandle` and that
   a stale or revoked cap fails closed.
2. Per-CPU runnable queues are reintroduced under the WFQ
   ordering rule. The single-global-queue fallback remains
   selectable via `CAPOS_SCHED_DISABLE_WFQ=1` for one bisect
   cycle and is retired before Phase E.
3. Migration preserves `virtual_runtime_ns` (already per-thread)
   and recomputes `virtual_finish_ns` at destination enqueue.
   The bounded steal path scans each sibling queue by index for
   that queue's first Runnable-for-destination entry (because each
   queue is ordered ascending by `virtual_finish_ns`, the first
   Runnable hit per queue is the lowest candidate the destination
   can accept on that source) and picks the queue whose first-
   Runnable candidate has the **lowest** `virtual_finish_ns`
   globally (the most overdue work another CPU has not yet
   dispatched), with ties broken by lower CPU id, matching the
   fair-share rule the local pick uses.
4. **Materially close the 1-to-4 capOS-vs-Linux thread-scale gap.**
   Concretely: a 5-run controlled `make run-thread-scale` against
   the post-WFQ kernel, pinned to physical-core logical CPUs
   `0,1,2,3` on `capos-bench`, must record capOS work speedup of
   at least `2.5x` at 1-to-4 (the recorded baseline is `1.566x`;
   Linux records `3.963x` against the same shape on the same
   pin set). The 1-to-2 row must keep the configured `1.6x`
   gate. Total speedup is reported as diagnostic and must not
   regress below the recorded `1.538x` 1-to-4 baseline.
5. `make run-spawn`, `make run-smp2-smokes`,
   `make run-smp-process-scale`, and `make run-measure` remain
   green. The recorded multi-process `1.6x` 1-to-2 gate from
   `2026-04-30` must hold.
6. `docs/proposals/scheduler-evolution-proposal.md` Phase D
   section, `docs/backlog/scheduler-evolution.md` Phase D
   bullets, `docs/architecture/scheduling.md`,
   `docs/changelog.md`, `WORKPLAN.md`, and `docs/roadmap.md`
   carry the closeout stamp with commit hash and minute-precision
   timestamp.

The plan is **not** scoped to deliver Phase E
(`SchedulingContext` budget/period authority), Phase F
(`CpuIsolationLease` and SQPOLL nohz), Phase G
(`RealtimeIsland`), or Phase H (userspace policy service). Those
are sequenced after Phase D and own their own plan files.

### Task 1: Schema and capability surface

Status: landed 2026-05-07 at commit cb8c58b1
(`sched(phase-d-task1): schema + capability surface`). Tasks 2-4
are unblocked. Tasks 4-6 are unblocked by Task 3 (2026-05-07 23:45
UTC).

- [x] Add the `invalidArgument` variant to the existing
      `ExceptionType` enum in `schema/capos.capnp`. The current
      enum has only `failed`/`overloaded`/`disconnected`/
      `unimplemented`; `setWeight` policy denial below needs a
      distinct typed signal (caller bug rejection vs general
      failure vs back-pressure). This addition is part of the
      Phase D schema-surface acquisition documented in the
      proposal Phase D capability surface section. Keep the
      variant ordering stable for ABI compatibility.
- [x] Add the `LatencyClass` enum (`interactive`, `normal`,
      `batch`, `ipcServer`) to `schema/capos.capnp` and
      regenerate bindings via `make generated-code-check`.
- [x] Add the `SchedulingPolicyCap` interface with `setWeight`,
      `setLatencyClass`, and `snapshot` methods. The snapshot
      return is narrow: `weight`, `class`, `runtimeNs`,
      `virtualRuntimeNs`. Those four fields are the ones Task 2
      promotes out of `cfg(feature = "measure")` unconditionally.
      Do NOT add `contextSwitches`, `preemptions`,
      `voluntaryBlocks`, or `migrations` to the ABI in this slice;
      those counters stay benchmark-only and would either fail to
      compile in the normal `qemu` build or expose fields the
      kernel does not track. A future operator-observability slice
      may add them through a separate snapshot cap.
- [x] Implement `setWeight` validation at the cap boundary
      (not the dispatch path) with the rule from the proposal:
      `weight = 0` and any nonzero value outside
      `[MIN_WEIGHT, MAX_WEIGHT]` (Phase D constants) are
      rejected with `CapException::InvalidArgument`. The
      kernel does NOT silently clamp out-of-range values; a
      future caller/test can rely on the rejection signal.
      This ensures no later divide-by-zero or overflow path
      is reachable through the cap. Implementation lives in
      `kernel/src/cap/sched_policy.rs`; the typed exception
      flows through the existing `kernel/src/cap/ring.rs`
      dispatcher via a sentinel-prefix channel because capnp
      0.25 has no `ErrorKind::InvalidArgument` variant and the
      enum is `#[non_exhaustive]`.
- [x] Add a `KernelCapSource::SchedulingPolicy` variant under
      the manifest grant path so a manifest can grant the cap to
      a named process. Phase D grants the cap only to focused-
      proof manifests (`system-thread-fairness.cue` and similar
      Task 5 smokes); the default boot manifest does NOT grant
      the cap in this slice. Wider authority (cross-process
      weight/class mutation, default-grant to a userspace policy
      service) belongs to the future Phase H plan.
- [x] Add a `capos-rt::client::SchedulingPolicyClient` typed
      wrapper that maps transport errors and `CapException`
      decode shape consistently with the existing clients.
- [x] Validate: `make fmt-check`, `make generated-code-check`,
      `cargo test-config`, `cargo test-lib`,
      `cargo build --features qemu` (warning-free).

### Task 2: Per-thread weight and latency-class state

Closeout: 2026-05-07 22:51 UTC. Cap-state binding decision:
context-derived caller-thread fallback. The cap routes every
method (`setWeight`, `setLatencyClass`, `snapshot`) to
`CapCallContext::caller_thread`; cross-thread/cross-process
mutation is deferred to Phase H. Phase D constants moved from
`kernel/src/cap/sched_policy.rs` into `capos-abi/src/scheduler.rs`
(`MIN_WEIGHT`, `MAX_WEIGHT`, `DEFAULT_WEIGHT`, `REFERENCE_WEIGHT`,
plus the new `INTERACTIVE_SLICE_DIVISOR = 2` and
`BATCH_SLICE_MULTIPLIER = 4`).

- [x] Add `weight: u16` and `latency_class: LatencyClass` fields
      to `Thread` (in `kernel/src/process.rs`), with default
      values matching the current single-class behavior
      (`weight = DEFAULT_WEIGHT`, `latency_class =
      LatencyClass::Normal`). These fields must be unconditional
      (not behind `cfg(feature = "measure")`) because they
      participate in dispatch ordering.
- [x] Promote `runtime_ns`, `virtual_runtime_ns`, and
      `last_started_ns` from `ThreadCpuAccounting` out of
      `cfg(feature = "measure")` so the WFQ ordering, the
      runtime-charge path, and the `snapshot` cap method work in
      the normal `qemu` build. The `context_switches`,
      `preemptions`, `voluntary_blocks`, and `migrations`
      counters stay behind the `measure` feature and are NOT
      exposed through `SchedulingPolicyCap.snapshot` in this
      slice. Documented in `docs/architecture/scheduling.md`.
- [x] Change the `charge_runtime` step so `virtual_runtime_ns`
      advances by `elapsed_ns * REFERENCE_WEIGHT / weight`
      instead of 1:1 with `elapsed_ns`. `runtime_ns` continues to
      advance 1:1 with elapsed time so monotonic CPU accounting
      and `snapshot.runtimeNs` are unchanged. This is the actual
      fairness mechanism; without it, weights affect only enqueue-
      order ties rather than cumulative share.
- [x] Add `virtual_finish_ns: u64` derived per enqueue and not
      stored across blocking. The derivation rule depends on
      `latency_class` per the proposal's "Latency-class semantics
      for Phase D" subsection: `Normal` and `IpcServer` use
      `vruntime + slice_ns * REFERENCE_WEIGHT / weight`;
      `Interactive` uses
      `vruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) *
      REFERENCE_WEIGHT / weight`; `Batch` uses
      `vruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) *
      REFERENCE_WEIGHT / weight`. `slice_ns` is the existing
      `crate::arch::context::TICK_NS` quantum;
      `REFERENCE_WEIGHT`, `MIN_WEIGHT`, `MAX_WEIGHT`,
      `DEFAULT_WEIGHT`, `INTERACTIVE_SLICE_DIVISOR`, and
      `BATCH_SLICE_MULTIPLIER` live in `capos-abi/src/scheduler.rs`.
- [x] Add the kernel-side mutation entry points behind the
      `SchedulingPolicyCap` dispatch only. No ambient process
      field, no per-process default, no syscall path that
      bypasses the cap. Caller-thread binding through
      `CapCallContext::caller_thread`; idle thread and stale
      thread refs return the standard `CallerNotLive` failure
      that surfaces to userspace as the disconnected-class
      `CapException` taxonomy entry.
- [x] Validate: `cargo build --features qemu`,
      `cargo build --features qemu,measure`, `cargo test-lib`,
      `make capos-rt-check`. Regression: `make run-smoke`,
      `make run-spawn`, `make run-measure`.

### Task 3: Per-CPU run queues and WFQ ordering

Closeout: 2026-05-07 23:45 UTC. Per-CPU `run_queues:
[VecDeque<ThreadRef>; SCHEDULER_CPUS]` reintroduced ordered ascending
by `Thread.virtual_finish_ns` via linear-scan insert (chosen for
simplicity at `SCHEDULER_CPUS = 4`; promotion to a smarter structure
deferred until benchmark evidence requires it). Each per-CPU queue is
reserved to the live runnable-capable thread count before publication
so the bounded steal path can migrate every live thread into a single
sibling queue without allocating in timer, unblock, direct-IPC
fallback, or steal-requeue paths. Local selection scans the local
queue by index for the first destination-Runnable entry, leaving
RetryLater entries in place so the dispatch pass cannot starve
runnable entries behind a non-runnable head whose
`virtual_finish_ns` has not changed. The bounded steal path scans
each sibling queue by index for that queue's first Runnable-for-
destination entry, then picks the queue whose first-Runnable
candidate has the lowest `virtual_finish_ns` globally (ties by lower
CPU id), which prevents stranded runnable work behind a sibling-head
RetryLater or single-CPU-owner constraint. The chosen entry is
removed from its actual position on the source queue, the WFQ tag
is recomputed at the destination, and the entry is inserted at the
destination's ordered position. `WakePolicy::QueueCpu(u32)` is
reinstated for endpoint, timer, park, process-wait, thread-join, and
process-spawn completions; the fallback `WakePolicy::QueueAny`
survives only as the build-time opt-out under
`CAPOS_SCHED_DISABLE_WFQ=1` (`option_env!`). The steal path remains
active under the opt-out so siblings drain queue 0, restoring the
pre-Task-3 single-global-queue behaviour on SMP. Migration-counter
increments and weight-change-after-block proof remain in Task 4
scope; the milestone gate (`make run-thread-scale 2.5x` 1-to-4)
remains in Task 6 scope.

- [x] Reintroduce `SchedulerDispatch.run_queues:
      [VecDeque<ThreadRef>; SCHEDULER_CPUS]` (the per-CPU
      bounded queues retired in the 2026-05-02 collapse). Reuse
      the documented runnable-ownership invariants in
      `docs/architecture/scheduling.md` (single dispatch owner
      per live `ThreadRef` across per-CPU `current`/
      `handoff_current` slots, the per-CPU run queues, and the
      direct IPC target slot).
- [x] Reintroduce the per-CPU live-reservation accounting that
      the pre-collapse design used: reserve all per-CPU queues
      to the live runnable-capable thread count before
      publication; release on process/thread exit or
      pre-publication rollback. Timer, unblock, direct-IPC
      fallback, and steal-requeue paths must remain
      allocation-free.
- [x] Order each per-CPU `VecDeque` ascending by
      `virtual_finish_ns`: enqueue inserts at the ordered
      position, selection picks the front (lowest
      `virtual_finish_ns` = most overdue against fair share).
      The exact ordering structure (sorted insert vs. small
      bucket array) is an implementation choice; document the
      decision in `docs/architecture/scheduling.md`. Linear-
      scan insert chosen.
- [x] Restore the bounded steal path: a CPU whose local queue
      has no runnable entry walks sibling per-CPU queues
      bounded by `SCHEDULER_CPUS`. The scan walks each sibling
      queue's indices ascending for that queue's first
      Runnable-for-destination entry; because the queue is
      ordered ascending by `virtual_finish_ns`, the first hit
      is the lowest `virtual_finish_ns` candidate the
      destination can accept on that source. The steal target
      is the queue whose first-Runnable candidate has the
      **lowest** `virtual_finish_ns` globally — that is the
      most overdue thread another CPU has not yet dispatched,
      and the same selection rule the local pick uses. Ties
      break by lower CPU id. The steal removes that entry
      from its actual position on the source queue (not
      necessarily the head: a RetryLater or single-CPU-owner
      thread may sit at the front and stay there), recomputes
      `virtual_finish_ns` at the destination, and inserts at
      the destination ordered position. The first-Runnable-
      per-queue scan is required so a non-runnable sibling
      head does not strand later runnable entries behind it.
- [x] Restore `WakePolicy::QueueCpu(u32)` (or the WFQ
      equivalent placement variant) so endpoint, timer, park,
      process-wait, and thread-join completions can target a
      specific per-CPU queue. The single-global-queue
      `WakePolicy::QueueAny` remains as the fallback under
      `CAPOS_SCHED_DISABLE_WFQ=1`.
- [x] Add `CAPOS_SCHED_DISABLE_WFQ=1` as a build-time opt-out
      (`option_env!`) for one bisect cycle; remove before
      Phase E.
- [x] Validate: `cargo build --features qemu`,
      `cargo build --features qemu,measure`, `cargo test-lib`,
      `cargo test-ring-loom`, `make run-spawn`,
      `make run-smp2-smokes`. Plus regression: `make run-smoke`,
      `make run-measure`, `make capos-rt-check`,
      `make fmt-check`, `make generated-code-check`,
      `cargo test-config`.

### Task 4: Migration fairness and weight propagation

Closeout: 2026-05-08 00:53 UTC.

- [x] Verify (and document) that `virtual_runtime_ns` travels
      with the thread on every migration. The accounting record
      already encodes this; the WFQ enqueue path must
      explicitly recompute `virtual_finish_ns` from the
      vruntime and weight at the destination, never carry it as
      committed state. Verified by tracing every enqueue site
      (`push_reserved_run_queue_locked` for the initial-publish
      and post-block arms, plus the steal-insert at
      `steal_from_sibling_queues_locked`); each routes through
      `refresh_virtual_finish_ns_locked`, which reads
      `thread.weight`, `thread.latency_class`, and
      `thread.cpu_accounting.virtual_runtime_ns` fresh and writes
      `Thread.virtual_finish_ns`. The function bears an explicit
      doc-comment asserting the invariant; the steal site bears a
      matching block comment.
- [x] Increment `ThreadCpuAccounting.migrations` on each
      cross-CPU enqueue, both for placement-time spread and
      for steal. Mirror the pre-collapse counter shape.
      Implemented as
      `record_placement_spread_migration_locked` (called from
      `push_reserved_run_queue_locked` when target slot
      differs from `ThreadCpuAccounting.last_cpu`) and
      `record_steal_migration_locked` (called from the steal
      arm unconditionally, since the scan skips the destination
      slot). The counter remains
      `cfg(feature = "measure")`-gated; the dispatch-time
      `scheduled_measure` path no longer increments migrations
      and now only updates `last_cpu` so the enqueue-time check
      has the previous CPU available. Steady-state shape mirrors
      the pre-collapse counter (a thread that runs on a
      different CPU than its previous run still records exactly
      one migration); the increment is now attributed to the
      enqueue decision rather than the dispatch that follows.
- [x] Prove that a thread whose weight changes through
      `SchedulingPolicyCap.setWeight` while it is enqueued
      observes the new weight on the next dequeue and
      re-enqueue; the weight must not be cached in
      `virtual_finish_ns` across blocking. Proved by
      construction: `setWeight` writes `Thread.weight` directly
      without touching `virtual_finish_ns`, and every enqueue
      site refreshes the WFQ tag from the current weight/
      latency-class/vruntime triple. Reinforced by an inline
      `debug_assert!` in
      `Process::refresh_thread_virtual_finish_ns` that the
      recomputed `virtual_finish_ns` is at or beyond the
      current `virtual_runtime_ns` (a future deadline, never a
      past one). A focused QEMU smoke that drives `setWeight`
      against an enqueued thread and verifies the post-block
      dispatch ordering is recorded as a Task 5 follow-up; see
      `docs/architecture/scheduling.md` Phase D Task 4 section.
- [x] Validate: `make run-spawn`, `make run-smp2-smokes`,
      `make run-thread-scale` (single-iteration functional
      check; the milestone gate runs in Task 6).

### Task 5: Test matrix smokes

- [x] **CPU hogs.** Reuse `make run-thread-scale` for the
      equal-weight functional path. The existing harness already
      records per-case `work` and `total` cycles and the
      `2026-05-02 21:38 UTC` 1-to-2 baseline still gates the
      `1.6x` evidence under WFQ. The differing-weights focused
      proof lives in the NEW `make run-thread-fairness` target
      driven by `system-thread-fairness.cue` and the demo at
      `demos/thread-fairness/`. The demo spawns three worker
      threads at WFQ weights `128:64:64` (the abi
      `DEFAULT_WEIGHT=64` doubled for the heavy worker keeps
      the assertion `2:1:1` while staying inside
      `MIN_WEIGHT..=MAX_WEIGHT`), runs each worker as a
      CPU-hog spinner under a fixed wallclock window, asks the
      kernel for each thread's `runtime_ns` via
      `SchedulingPolicyCap.snapshot`, and asserts that the
      observed ratio falls inside a `±20%` tolerance window
      around the weight-proportional target. The harness
      `tools/qemu-thread-fairness-smoke.sh assert fairness`
      checks the demo emitted its `[thread-fairness] window_ns=`
      summary line with three nonzero `runtime_ns` values and
      that the demo's per-worker tolerance pass `fairness ratio
      ok within 20%` succeeded.
- [x] **Short sleepers.** Closed by the
      `make run-thread-fairness-interactive` target driven by
      `system-thread-fairness-interactive.cue`. The same demo
      binary spawns one CPU-hog worker (default weight, Normal
      class) plus one Timer-sleeper worker (default weight,
      Interactive latency class). The sleeper repeatedly calls
      `Timer.sleep` for a known short interval, computes
      observed wake-to-run latency as
      `now_after - now_before - sleep_ns`, drops the first four
      "settle" rounds, and asserts that the maximum observed
      latency stays below `4 * TICK_NS` (40 ms). The harness
      `tools/qemu-thread-fairness-smoke.sh assert interactive`
      verifies the demo's bound check passed and the
      `interactive latency ok max=` summary line was printed.
      The bound is intentionally generous starting from
      `4 * TICK_NS` so flake-rate is acceptable on KVM-less
      QEMU; tighten with bench evidence in a follow-up if
      needed.
- [x] **Direct IPC server/client pairs.** `make run-spawn`
      (and its qemu-spawn-smoke harness) remains the
      regression gate. The direct-IPC preference slot's
      generation-checked semantics are unchanged under WFQ
      (Task 3 review confirmed: the `WakePolicy::QueueCpu`
      placement intent travels through the same direct-IPC
      handoff, and the per-CPU dispatch still polls the
      preference slot before the run-queue front), so the
      Task 5 contribution here is a regression assertion
      via the existing harness rather than new explicit
      paired-call timing. A timing-delta assertion against a
      historical baseline is recorded as a Task 5 follow-up
      (would require recording paired-call medians per build
      and a per-host noise window; out of scope for this
      slice).
- [x] **Multi-process load.** Reuse `make run-smp-process-scale`.
      The recorded `1.6x` 1-to-2 gate continues to hold under
      WFQ. If a future run trips the gate, the failure blocks
      Task 6 progression and indicates a WFQ regression that
      must be diagnosed (steal-scan cost, weight-application
      latency, per-CPU queue contention) rather than relaxed.
- [x] **Weight-change-while-enqueued QEMU smoke** (Task 4
      deferral). Closed by the
      `make run-thread-fairness-weight-change` target driven
      by `system-thread-fairness-weight-change.cue`. Two
      competing child threads run a fixed wallclock window:
      the baseline worker stays at `DEFAULT_WEIGHT`, while the
      heavy worker self-calls
      `SchedulingPolicyCap.setWeight(weight=128)` and then
      blocks on `Timer.sleep` so it leaves the run queue
      before the contention window opens. Each worker
      snapshots its scheduler state at wake and at window end
      via `SchedulingPolicyCap.snapshot`, and the parent
      asserts three independent things: (1) the heavy
      snapshot reads `weight == 128` and the baseline reads
      `weight == DEFAULT_WEIGHT`; (2) the observed
      `runtime_ns` ratio under contention matches the weight
      ratio (target 2:1) within `±25%`; (3) the heavy
      worker's `virtual_runtime_ns` advances at half the rate
      of its `runtime_ns` (vruntime/runtime ~= 0.5 for
      weight=128 vs ~= 1.0 for the DEFAULT_WEIGHT baseline)
      within `±30%`. The third check is the smoking gun for
      a stale-weight regression: a scheduler that kept the
      pre-`setWeight` weight inside `charge_runtime` would
      yield heavy vruntime/runtime ~= 1.0 instead of ~= 0.5,
      and the assertion would trip even if WFQ ordering self-
      corrected after the first dispatch. Together with the
      runtime-ratio assertion this exercises the Task 4
      invariant that every enqueue site (and dispatch
      charging path) reads the current
      `weight`/`latency_class`/`virtual_runtime_ns` triple
      rather than reusing a cached value. Note that
      `SchedulingPolicyCap` is bound to
      `CapCallContext::caller_thread` per Task 2, so the
      thread mutating its own weight is the only authorized
      shape for the proof; cross-thread weight mutation is a
      Phase H privileged scheduler-policy service concern.
- [x] **Same-process sibling load.** This is the same shape
      as `make run-thread-scale` from Task 6; the milestone
      gate covers it.
- [x] Validate: each new smoke under `make run-*` passes;
      existing smokes remain green.

Closeout: `2026-05-08 02:00 UTC`. Test infrastructure only; no
kernel scheduler logic changes.

### Task 6: Milestone gate — controlled `make run-thread-scale`

- [ ] Run a 5-run controlled `make run-thread-scale` on
      `capos-bench`, pinned to physical-core logical CPUs
      `0,1,2,3`, against the post-WFQ kernel. Use the same
      benchmark shape as the recorded `2026-05-02 21:38 UTC`
      pair: blocking parent join, 262,144 blocks (16 MiB),
      `work_rounds=64`,
      `CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1`,
      `CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1`.
- [ ] Required outcome: capOS work speedup of at least
      `2.5x` at 1-to-4 (the recorded baseline is `1.566x`).
      The 1-to-2 row must keep the configured `1.6x` gate.
      Total speedup is reported as diagnostic and must not
      regress below the recorded `1.538x` 1-to-4 baseline.
- [ ] Rerun the matching Linux pthread baseline
      (`make run-linux-thread-scale-baseline`) on the same
      pin set so the comparison stays apples-to-apples; the
      Linux number is informational, not a gate.
- [ ] Capture raw artifacts under
      `target/thread-scale/<timestamp>/` and
      `target/linux-thread-scale/<timestamp>/`. Record the
      pair in `docs/changelog.md` Phase D entry.
- [ ] If the gate is not met: do not weaken the threshold.
      Diagnose the remaining bottleneck (scheduler-lock hold
      time, steal scan cost, weight-application latency,
      per-CPU queue contention) and submit a follow-up slice
      under this plan; the gate stays at `2.5x`.

### Task 7: Documentation and closeout

- [ ] Update `docs/architecture/scheduling.md` to describe
      the per-CPU runnable queue, the WFQ ordering rule,
      the migration/steal contract, the
      `SchedulingPolicyCap` cap surface, and the new
      runnable-ownership invariants. Mark the
      single-global-queue `WakePolicy::QueueAny` and
      `CAPOS_SCHED_DISABLE_WFQ=1` fallback as historical
      once retired.
- [ ] Update `docs/proposals/scheduler-evolution-proposal.md`
      Stage 3 status to "first slice landed" with commit hash
      and minute-precision timestamp; keep the EEVDF deferred
      follow-on note.
- [ ] Update `docs/backlog/scheduler-evolution.md` Phase D
      bullets with closeout stamps for each item; add the new
      "Phase D follow-on: EEVDF migration" item under Phase D
      so the deferred work is tracked.
- [ ] Update `docs/roadmap.md` scheduler section to reflect
      Phase D landed; sequence Phase E next.
- [ ] Update `WORKPLAN.md` to remove the active "Scheduler
      Phase D" bullet and add a "Scheduler Phase D landed
      (closeout)" bullet referencing the commit and the
      `make run-thread-scale` evidence pair.
- [ ] Update `docs/plans/README.md` Track Map row to mark
      this plan completed and move it to
      `docs/plans/completed/` per the directory's lifecycle
      contract.
- [ ] Add a `docs/changelog.md` entry under "Scheduler Phase
      D landed" with the recorded `make run-thread-scale`
      evidence pair, the matching Linux baseline, and the
      commit hash.
- [ ] Validate: every command under "Validation Commands"
      above passes; the closeout commit lands clean.