Scheduler Evolution Backlog

This backlog decomposes future scheduler architecture from Scheduler Evolution. It also retains the completed attribution and placement history that closed the In-Process Threading Scalability milestone; new selected-milestone work now continues from the loopyard board (https://tasks.cap-os.dev/p/capos/board).

Design Grounding Checklist

Before implementation slices, read:

docs/architecture/scheduling.md
docs/backlog/smp-phase-c.md
docs/proposals/smp-proposal.md
docs/proposals/ring-v2-smp-proposal.md
docs/proposals/tickless-realtime-scheduling-proposal.md
docs/proposals/stateful-task-job-graphs-proposal.md
docs/proposals/scheduler-evolution-proposal.md
docs/proposals/system-performance-benchmarks-proposal.md
docs/proposals/hpc-parallel-patterns-proposal.md
docs/research/future-scheduler-architecture.md
docs/research/nohz-sqpoll-realtime.md
docs/research/out-of-kernel-scheduling.md
docs/research/completion-ring-threading.md
docs/research/hpc-parallel-patterns.md

For realtime or isolation slices, also read:

docs/research/multimedia-pipeline-latency.md
docs/research/robotics-realtime-control.md
docs/research/x2apic-and-virtualization.md

Phase A: Attribution and Guardrails

Finish first-pass thread-scale attribution guardrails. Scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock, timer interrupt, CR3/TLB, raw guest-PC sample, logging-suppression A/B, exact Linux pthread baseline, compact-versus-padded result-slot diagnostic, and larger-workload/Amdahl evidence now exist. The evidence does not identify the primary remaining non-scaling cause; it keeps per-CPU runnable ownership, accepted threshold-passing work/total evidence, and optional symbolic attribution as follow-on work.
Add bounded scheduler-lock site attribution before a structural lock split. As of 2026-05-01 09:52 UTC, measure builds keep the compatible aggregate scheduler_lock line and also emit aggregate plus per-phase scheduler_lock_site counters for generic, timer pre-ring, timer select, blocking, process exit, thread exit, start/idle selection, wake/unblock, and metadata classes. This is split-prep attribution only; it does not accept the in-process thread-scale milestone.
Add timer-fast-path attribution for the bounded continuation path. As of 2026-05-01 10:58 UTC, measure builds extend the aggregate and per-phase timer counter lines with fast-path attempts, continues, and fallback reasons for slow-required/dirty summaries, skip-budget exhaustion, pending reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid scheduler CPUs. The thread-scale harness requires those fields only for CAPOS_THREAD_SCALE_GUEST_MEASURE=1. This is attribution only; it does not change scheduler behavior and does not close the current accepted=false work or total gates. Local one-run evidence in target/thread-scale/20260501T110157Z/ passed with the new fields present in every 1/2/4-thread measure.log; the timed work phase recorded fast_path_continues=0 for all three rows.
Add timer slow-summary reason attribution for dirty fast-path summaries. As of 2026-05-01 11:28 UTC, measure builds emit aggregate and per-phase timer_slow_summary lines with required/clean counts plus reason fields for nonempty run queues, direct IPC targets, handoff-current state, pending process termination/drop/stack release, timer sleeps, and timed cap-enter versus park waiters. The harness requires those lines only for CAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run evidence in target/thread-scale/20260501T112359Z/ passed with the new lines present in every 1/2/4-thread measure.log; the timed work phase reported dirty summaries attributable to run_queue_nonempty and handoff_current only, with required=2/4/8, clean=0, and timer sleeps/timed waiters at zero for the 1/2/4-thread rows. The subsequent fairness-only behavior slice keeps the same fields, but required now means direct IPC, deferred cleanup, timer sleeps, or timed waiter work still force the next locked timer pass.
Complete thread-scale shared-kernel-state contention attribution guardrails beyond the first measure-only lock-counter slice. As of 2026-05-01 08:07 UTC, CAPOS_THREAD_SCALE_GUEST_MEASURE=1 emits aggregate and per-phase shared_kernel_lock counters for frame allocator alloc/free locks, ring-dispatch cap-table and ring-scratch locks before cap::ring::process_ring, endpoint inner/cancellation scratch locks, direct per-process address-space locks, and heap allocator locks. As of 2026-05-01 08:29 UTC, fresh thread-scale rows also carry explicit benchmark-class fields and the harness requires, validates, and exports those fields to results.csv; local one-run evidence is retained in target/thread-scale/20260501T083254Z/. As of 2026-05-01 08:49 UTC, guest-measure runs also emit and require aggregate and per-phase network_poll counters for initialized virtio-net scheduler/runtime/interface polling, the built-in TCP HTTP proof poll, virtqueue poll spins and completions, and pending network waiter scans. Local one-run evidence in target/thread-scale/20260501T093505Z/ passed and retained zero aggregate and per-phase network/poll counters for the 1/2/4-thread rows. The default thread-scale manifest has no virtio-net device, and the scheduler poll entry returns before the driver mutex in that no-device case. Those counters are expected zero-evidence for the CPU-bound thread-scale benchmark. They do not prove service throughput; future service/network benchmarks still need their own hot-section attribution and acceptance evidence.
Add a benchmark-kernel mode that suppresses per-context-switch logging during measured cases so serial MMIO cannot masquerade as scheduler cost. Completed with CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1; benchmark proof/error output and measure lines remain enabled.
Decide which counters are permanent observability and which stay behind measure. Completed 2026-05-01 04:55 UTC in docs/architecture/scheduling.md: all existing kernel/src/measure.rs counters remain benchmark-only behind the measure feature. Permanent scheduler observability should be added later through a separate low-overhead operator snapshot surface after the Phase C runtime accounting ledger exists, starting with runtime, context-switch, preemption, voluntary-block, migration, queue-depth, reschedule-IPI, TLB-shootdown, and policy admission/denial counts. Phase/cycle attribution, scheduler-lock wait/hold cycles, serial byte attribution, timer/TLB benchmark totals, raw user-PC samples, and thread-scale phase checkpoints stay behind CAPOS_THREAD_SCALE_GUEST_MEASURE=1. Grounding read: docs/architecture/scheduling.md, docs/proposals/scheduler-evolution-proposal.md, docs/research/future-scheduler-architecture.md, docs/research/out-of-kernel-scheduling.md, docs/research/nohz-sqpoll-realtime.md, and docs/research/completion-ring-threading.md.
Record controlled benchmark-VM evidence before and after each scheduler structure change. Latest follow-up after the first Phase C runtime-accounting slice reran the in-process thread-scale diagnostic at main commit a88e7906 with QEMU pinned to physical-core logical CPUs 0-3 and SMT logical CPUs 0-7. All rows remained accepted=false: physical 1/2/4 work speedups were 1.000x and 0.999x, and SMT 1/2/4/8 work speedups were 1.000x, 1.001x, and 0.333x. Follow-up after the total-speedup host-summary gate landed reran current main commit f198b099 on the benchmark VM with QEMU pinned to 0-3 and 0-7. The harness now reports total-speedup diagnostics explicitly: physical 1/2/4 work speedups were 1.002x and 1.002x, total speedups were 0.911x and 0.601x; SMT diagnostic 1/2/4/8 work speedups were 1.001x, 0.998x, and 0.333x, total speedups were 0.913x, 0.621x, and 0.200x. Both host-summary gates remain unsatisfied.

Phase B: Per-CPU Runnable Ownership

Land the first bounded per-CPU runnable queue slice. Commit 1a8bf909 replaces the single global scheduler VecDeque with four per-scheduler-CPU FIFO queues under the existing global scheduler lock, centralizes enqueue/requeue/removal helpers, keeps single-owner capability processes on CPU0, prefers local work before bounded stealing, preserves direct IPC preference, and removes stale runnable entries for process/thread exit. Review fixes track live run-queue reservations, reserve all per-CPU queues to that count before publishing a new runnable thread, and release reservations on process/thread exit or pre-publication rollback, keeping timer and unblock requeue paths allocation-free after cross-CPU steals. Verification covered run-spawn, test-smp2-smokes, and controlled benchmark-VM 1/2/4/8-thread diagnostics. The default workload and total-case 64 MiB rows remain accepted=false, so this is structure evidence, not milestone closeout.
Finish PerCpuRunQueue ownership invariants as a documented contract. Completed 2026-05-01 02:13 UTC in docs/architecture/scheduling.md: a live generation-checked ThreadRef has at most one runnable dispatch owner across current slots, per-CPU run queues, and the direct IPC target; migration is a scheduler-lock-contained remove-before-publish transfer; local-first stealing is bounded by the scheduler CPU slots; and live run-queue reservations keep timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths allocation-free.
Split current-thread and runnable ownership from shared process/thread metadata without widening emergency-path allocation. Completed 2026-05-01 04:22 UTC in commit d7221648: Scheduler::processes remains the shared process/thread metadata table, while SchedulerDispatch now owns per-CPU run queues, current and handoff slots, idle slots, the direct IPC target, run-queue reservation count, pending process drops, and pending thread stack releases. The existing global scheduler lock and generation checks are unchanged, and the dispatch split keeps the pre-reserved run-queue capacity model for timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths. Verification passed make fmt-check, cargo build --features qemu, a cached make run-spawn rerun, and make test-smp2-smokes in target/smp2-smokes/20260501T042343Z/. Controlled benchmark-VM timing after merge 56458b12 stayed accepted=false:
```
| Pinning | Workers | Work Median | Total Median | Work Speedup | Total Speedup |
| --- | ---: | ---: | ---: | ---: | ---: |
| physical `0-3` | 1 | `56275842` | `140953762` | `1.000x` | `1.000x` |
| physical `0-3` | 2 | `56290542` | `153327094` | `1.000x` | `0.919x` |
| physical `0-3` | 4 | `56315094` | `237018874` | `0.999x` | `0.595x` |
| SMT `0-7` | 1 | `56258010` | `140620194` | `1.000x` | `1.000x` |
| SMT `0-7` | 2 | `56313324` | `153367860` | `0.999x` | `0.917x` |
| SMT `0-7` | 4 | `56352472` | `237971426` | `0.998x` | `0.591x` |
| SMT `0-7` | 8 | `169006414` | `727393630` | `0.333x` | `0.193x` |
```
Add a bounded timer continuation fast path before a broader scheduler lock split. Completed 2026-05-01 10:29 UTC: user-mode LAPIC timer ticks can continue the current non-idle thread without calling sched::schedule() only when a previous locked timer slow path published a clean hard-work summary, the current CPU is a valid active scheduler slot, no reschedule IPI is pending for that CPU, and the per-CPU one-skip budget is not exhausted. Dirty producers still force at least one locked pass before bypass, but the 2026-05-01 11:40 UTC follow-up lets that pass classify remaining nonempty run queues and handoff-current markers as fairness/protection-only state. Direct IPC targets, deferred termination/drop/stack cleanup, Timer sleeps, and timed cap-enter/Park waiters still keep the hard slow-path bit set; ordinary ring SQEs and indefinite cap wait scans are still serviced by forced slow-path ticks. This is a correctness-first split-prep slice, not a replacement for narrower scheduler metadata locks or accepted thread-scale evidence. Controlled benchmark-VM physical-core 0-3 before/after runs for the initial strict-clean version retained accepted=false: baseline target/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/ recorded work speedups 0.998x and 0.998x plus total speedups 0.907x and 0.620x; after-change target/thread-scale/timer-fastpath-after-physical-20260501T104700/ recorded work speedups 1.001x and 0.999x plus total speedups 0.909x and 0.602x. Controlled benchmark-VM physical-core 0-3 before/after runs for the fairness-only follow-up stayed accepted=false: baseline target/thread-scale/20260501T120224Z/ recorded work speedups 1.001x and 0.999x plus total speedups 0.913x and 0.587x; after-change target/thread-scale/20260501T120709Z/ recorded work speedups 1.001x and 1.000x plus total speedups 1.125x and 0.828x.
Add cross-CPU wake policy for endpoint, timer, park, process wait, and thread join completions. Completed 2026-05-01 03:06 UTC: queued wakeups now target the selected per-scheduler-CPU FIFO owner instead of scanning all idle scheduler CPUs.
Add explicit placement evidence and placement policy for newly runnable same-process worker threads. Completed 2026-05-01 12:37 UTC, refined 2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the blocking-parent benchmark exposed a placement regression. Measure builds emit aggregate and per-phase thread_placement lines with single-owner publish buckets, normal publish buckets, caller-current publish buckets, caller-aware avoid, fallback, and strict-load fallback counts, selected CPU buckets, first-selected CPU buckets, and migration totals/targets for CPU slots 0-3. publish_created_thread() receives the caller thread from ThreadSpawner.create, keeps single-owner processes on CPU0, and avoids the caller’s current CPU only when another active ready scheduler CPU has a strictly lower non-idle dispatch load. On equal load, an active-ready caller CPU wins the tie instead of falling through to CPU0-biased least-loaded scanning; if the caller slot is unknown or ineligible, publication falls back to the least-loaded active scheduler CPU behavior. Timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths keep their existing allocation-free targeting behavior.
```
The earlier avoid-caller policy passed the old spinning-parent 1-to-2
gate but failed the repaired blocking-parent shape: before the strict-load
fix, controlled capOS evidence regressed to 1-to-2 work/total speedups
`0.886x`/`0.928x` because children were biased onto the non-caller queue
even when the caller CPU had equal load. The repaired benchmark shape uses
blocking parent join, 262,144 blocks (16 MiB), and `work_rounds=64`. The
matching Linux baseline scales on the selected physical CPU set with
1-to-4 work/total speedups `3.958x`/`3.834x`. Controlled capOS evidence
on the same CPU set passed the enforced 1-to-2 work/total gates with
`1.828x`/`1.687x`; the unsuppressed 1-to-4 diagnostic recorded
`3.029x`/`2.386x`, and scheduler-switch-log-suppressed diagnostics
recorded `3.272x`/`2.303x`. Remaining four-worker limits are now
scheduler implementation issues, not benchmark-shape excuses: serial
switch logging, global `Scheduler` lock contention, total-time
exit/join/block/schedule overhead, and the temporary four-owner CPU mask.
```
Add bounded reschedule IPI behavior for idle-to-runnable transitions. Completed 2026-05-01 03:06 UTC: queued wakeups target at most one queue-owner CPU, direct IPC targets at most one eligible idle scheduler CPU, and measure builds emit wake scan, eligible idle CPU, target, sent, pending-skip, not-ready-skip, missing-target, and failure counters.
Preserve direct IPC handoff as a scheduling preference without bypassing per-CPU ownership or generation checks. Completed 2026-05-01 03:06 UTC: direct IPC still uses the single preference slot when available and falls back to the normal queued owner path when the target cannot run directly.
Restore eligibility-aware placement for constrained runnable owners. Completed 2026-07-19 02:15 UTC in scheduler-pinned-cap-enter-wake-placement: the common reserved-queue publication path resolves the preferred local slot to CPU0 for processes that retain process-wide endpoint or launch authority, then derives the sleeper-credit floor and WFQ tag against that actual destination. Initial publish, post-block wake, preemption requeue, direct-IPC fallback, and direct-target RetryLater requeue all use the same resolver; wake policy carries the committed slot. Ordered insertion asserts the eligibility invariant without changing ordinary local-first placement, runnable-owner uniqueness, or the bounded steal policy. The legacy WebUI proof is fixed at two CPUs, asserts CPU1 is online with its timer, and requires a bounded proof-scoped marker tying the WebUI pid’s CPU1-preferred publication to resolved CPU0 ownership. Delayed repeated external requests must also progress across multiple five-second idle cap_enter cycles before it accepts the full bundle result.
Prove process/thread exit cleanup cannot leave a stale runnable entry on any CPU queue. Completed 2026-05-01 03:14 UTC: process termination, current-process exit, and ThreadControl.exitThread cleanup now assert under the scheduler lock that the exiting process or thread no longer appears in any per-scheduler-CPU FIFO or in the direct IPC target slot. The focused spawn smoke asserts the serial proof markers emitted by the exercised process/thread exit paths.
Rerun make test-thread-scale, make test-smp2-smokes, ordinary smoke, spawn/thread, park, ring, and process-exit focused proofs. Completed 2026-05-01 04:18 UTC: local serial reruns passed normal make test-thread-scale in target/thread-scale/scheduler-phaseb-rerun-local-normal-20260501T034800Z/ and make test-smp2-smokes in target/smp2-smokes/20260501T034414Z/. Controlled benchmark-VM reruns at main commit 87be6e25 pinned QEMU to physical-core logical CPUs 0-3 and SMT logical CPUs 0-7; all rows remained accepted=false, so this closes the Phase B rerun-evidence gate but not the selected in-process speedup milestone.

Phase C: CPU Accounting

Add monotonic runtime charge points when a running thread leaves the CPU at context switch, preemption, blocking syscall, direct IPC handoff, and thread exit. Completed 2026-05-01 05:08 UTC: running intervals are charged with crate::arch::context::monotonic_ns() when a current thread stops running through timer preemption, blocking cap_enter/ParkSpace, thread/process exit, and direct switch or handoff paths that select the next current thread.
Observe blocked runtime stability at unblock without charging non-running time. Completed 2026-05-01 05:08 UTC: unblock paths check the blocked runtime snapshot before making the thread ready.
Track per-thread runtime, virtual runtime seed, context switches, preemptions, voluntary blocks, and migrations. Completed 2026-05-01 05:08 UTC: ThreadCpuAccounting is stored on each Thread record and updated under the scheduler/process lock. Context switch counters increment when a thread is selected, preemptions increment only for timer-driven running-to-ready requeue, voluntary blocks increment for blocking cap_enter and ParkSpace waits, and migrations increment when a thread runs on a different scheduler CPU than its previous run.
Add process/session/service aggregation only after the per-thread record has a single ledger of record. Completed 2026-05-22 13:50 UTC: a per-Process ProcessCpuAccounting ledger sums runtime_ns and a process-level context_switches dispatch count incrementally at the same scheduler/process-lock charge points that update ThreadCpuAccounting, so it captures exited threads’ contributions. Only the always-present (non-measure) per-thread quantities are rolled up; the measure-gated preemptions/voluntary_blocks/migrations counters are intentionally not aggregated so the default-build proof stays meaningful. The kernel emits a sched: process_cpu_accounting pid=... runtime_ns=... context_switches=... line at per-process exit and make run-spawn asserts a nonzero aggregate. Session/service aggregation remains a stretch follow-on.
Add tests or QEMU diagnostics proving runtime increases while running and stops while blocked. Completed 2026-05-01 05:08 UTC: make run-spawn now asserts a compact scheduler proof line that requires nonzero runtime, context switches, preemptions, and voluntary blocks, plus stable blocked and exited runtime observations.
Keep runtime accounting independent of tickless idle by using the monotonic clocksource layer. Completed 2026-05-01 05:08 UTC: normal accounting uses monotonic_ns() and does not read kernel/src/measure.rs cycle counters.

Phase D: Best-Effort Fair Scheduling

Phase D accepted its Task 6 diagnostic closeout at commit 77caafc0 (2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate) and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d). The first Phase D policy is weighted fair queueing on top of the existing per-thread runtime_ns / virtual_runtime_ns accounting, with a capability-authorized SchedulingPolicyCap for weight and latency-class mutation. The controlled Task 6 benchmark pair passed the harness-enforced 1-to-2 work/total gates; capOS recorded 1-to-4 work/total diagnostics 3.088x / 2.700x at 4 workers versus the prior single-global-queue baseline 1.566x / 1.538x, and that 1-to-4 row was manually accepted for Phase D closeout. The matching Linux pthread baseline on the same host and physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. EEVDF is now a follow-on policy evaluation, not a Phase D blocker. The design content is in docs/proposals/scheduler-evolution-proposal.md “Phase D first-policy decision”, “Phase D capability surface”, “Phase D migration fairness sketch”, “Phase D test matrix”, and “Phase D overload behavior” sections. The completed implementation plan is archived at docs/backlog/scheduler-evolution.md.

The bullets below retain the closed acceptance gates and the Phase D follow-ons that should be selected explicitly. Phase E SchedulingContext is the next scheduler authority phase, followed by Phase F auto-nohz / SQPOLL / tickless idle; generic full-nohz remains deferred behind those prerequisites.

Choose initial weighted-fair or EEVDF-like policy based on accounting and queue data. Resolved 2026-05-05 19:00 UTC: WFQ first; EEVDF deferred. See docs/proposals/scheduler-evolution-proposal.md “Phase D first-policy decision”.
Add scheduler entity weights and latency class metadata through a capability-authorized policy path, not ambient process fields. Closed by docs/backlog/scheduler-evolution.md Tasks 1-2: SchedulingPolicyCap schema + kernel cap, per-thread weight/latency_class fields, weighted vruntime, and caller-thread cap binding.
Preserve fairness across CPU migration. Implementation tracked in docs/backlog/scheduler-evolution.md Task 4 (vruntime travels with the thread, virtual_finish_ns recomputed at destination enqueue, bounded steal targets the queue whose head has the lowest virtual_finish_ns, matching the local pick rule of taking the front of the ascending per-CPU queue). Closed 2026-05-08 00:53 UTC: invariants made explicit on refresh_virtual_finish_ns_locked and at the steal-insert site; the cfg(feature = "measure")-gated ThreadCpuAccounting.migrations counter moved from the dispatch-time scheduled_measure path to enqueue-time record_placement_spread_migration_locked and record_steal_migration_locked arms; weight-change-while- enqueued contract proved by construction with a debug_assert! reinforcement in Process::refresh_thread_virtual_finish_ns.
Test CPU hogs, short sleepers, direct IPC server/client pairs, multi-process load, and same-process sibling load. Implementation tracked in docs/backlog/scheduler-evolution.md Task 5 (test matrix smokes) and Task 6 (the controlled make test-thread-scale evidence pair: harness-enforced 1-to-2 gates plus a manually accepted 1-to-4 diagnostic closeout row). Closed 2026-05-10 19:46 UTC: the benchmark-VM Task 6 run at commit 76025f0963a4 recorded capOS 1-to-4 work/total diagnostics 3.088x / 2.700x; the 1-to-2 gate stayed green at 1.809x / 1.774x. The matching Linux pthread baseline on the same physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x.
Define overload behavior when runnable entities exceed the selected CPU set or when migration cannot keep up. Resolved at the design level 2026-05-05 19:00 UTC: soft overload uses vruntime ordering (no entity is starved); hard overload defers to Phase F CpuIsolationLease and Phase G RealtimeIsland. See docs/proposals/scheduler-evolution-proposal.md “Phase D overload behavior”.
Phase D follow-on: EEVDF migration. Once the WFQ slice has accepted thread-scale evidence, evaluate replacing the bucketed per-CPU VecDeque with an EEVDF eligibility set (BTreeMap-by-virtual-deadline) plus per-thread request size and lag accounting. The accounting fields, capability surface, and migration contract carry directly; the change is localized to the dispatch ordering structure. Promote to its own design slice if and when selected; do not bundle it into the WFQ first-slice plan.

Phase E: SchedulingContext Capability

Phase E policy follow-ups are closed. Local owner-shell logout propagation is recorded in scheduler-phase-e-local-owner-shell-logout-propagation. Endpoint donation/return, timeout/depletion notifications, and the scheduler-observable session lifecycle hook are recorded on main: scheduler-phase-e-endpoint-donation, scheduler-phase-e-timeout-depletion-notifications, and scheduler-session-lifecycle-hook. The donated-context logout policy is also closed as a conservative counted/skipped return-path proof: scheduler-phase-e-session-logout-donated-context-policy. Timeout/depletion notifications now use fixed per-context notification cells allocated at context creation/bootstrap. The ordinary non-donated session-logout stale-context proof is complete through the UserSession.logout() hook. In-flight endpoint donation uses the conservative counted/skipped policy during logout and relies on endpoint RETURN/cancel to finish the in-flight transfer/clear without returning donor budget early. Local owner-shell exit now calls the same UserSession.logout() path on clean REPL exit or terminal-close completion; the shell proof observes the scheduler hook with no bound local shell SchedulingContext, while the focused session-context proof remains the ordinary bound-context stale evidence.

Phase E preflight: retire the transitional CAPOS_SCHED_DISABLE_WFQ=1 / WakePolicy::QueueAny single-global-queue fallback that Phase D kept for one bisect cycle. This is a scheduler-surface cleanup before SchedulingContext claims budget/period authority; do not treat it as an EEVDF blocker. Completed 2026-05-10 22:20 UTC: the source-level opt-out, queue-0 enqueue funnel, and QueueAny wake policy are gone.
Define the first SchedulingContext object shape. Phase E Task 1 adds the minimal schema/control-plane cap shape: SchedulingContextSpec carries budget, period, relative deadline, byte-oriented CPU mask, and overrun policy; SchedulingContextInfo is a read-only snapshot with remainingBudgetNs as derived info-only state; and the kernel/runtime expose an info-only SchedulingContext.info() cap stub for focused grant/discovery and client decode coverage. The cpuMask field is a canonical little-endian bitset: CPU n is bit n % 8 of byte n / 8, empty means no CPUs selected, producers omit trailing zero bytes, and non-empty canonical masks end in a nonzero byte. Dispatcher budget enforcement, replenishment, bind/revoke rules, donation/return, depletion notifications, realtime islands, SQPOLL, and nohz remain deferred.
Add capability creation/bind/revoke rules and generation identity. The second Phase E control-plane slice keeps info() method id 0 stable, adds same-interface context creation as a bounded result-cap transfer, records at most one caller-thread binding per context generation, and revokes by advancing the context generation and clearing the matching thread metadata binding. Bootstrap grants and created contexts use the same non-wrapping context-id allocator so distinct caps cannot alias the (contextId, generation) binding key. The focused make run-scheduling-context QEMU smoke proves distinct bootstrap identities, create result-cap adoption, bind/revoke, stale-generation calls, release cleanup, and the explicit infoOnlyNoDispatchChange dispatch-effect marker. Stale caps report staleGeneration and cannot mutate scheduler metadata; revoked contexts report revoked. Dispatch selection, WFQ ordering, runtime charging, replenishment, donation/return, timeout/depletion notification, realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain future work.
Enforce budget and replenishment in the kernel dispatcher. First Phase E budget enforcement landed 2026-05-11 08:38 UTC: bindCallerThread() now installs a fixed per-thread budget ledger under the scheduler/process locking model, runtime charge decrements the bound context budget at the existing dispatch charge points, runnable selection replenishes elapsed periods without allocation, and exhausted contexts stay queued but RetryLater until their next period. Deadline-driven accounting closed the previous periodic-tick granularity caveat on 2026-06-04: the ordinary dispatch path arms a sub-tick budget-exhaustion one-shot when the selected thread’s remaining budget would deplete before the next scheduler tick, kernel-mode one-shot fires restore a live periodic timer, nohz re-arm folds the leased thread’s budget deadline into its existing nearest deadline, and nohz budget depletion restores the periodic tick with reason=scheduling-context-budget-throttled. make run-scheduling-context proves visible charge, replenishment to full budget, stale/revoked fail-closed behavior, and a throttled wall-clock window with dispatch_effect=budgetEnforced; the representative 5 ms deadline marker recorded elapsed_since_arm_ns=5474819, overshoot_ns=474819, remaining_after_ns=0, and bounded_charge=true. At that slice’s landing, donation/return, depletion notifications, realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remained future work.
Add endpoint donation/return semantics for synchronous calls and passive services. Completed 2026-05-11 10:51 UTC: endpoint in-flight call state now carries a bounded internal donation token when a caller with a bound SchedulingContext delivers a synchronous CALL to a receiver thread without its own context. The scheduler charges pre-donation caller runtime before moving the ledger, charges passive-server runtime before returning the ledger, and returns the remaining budget to the caller before waking it when RETURN commits, commits an application exception, or fails with an invalid caller result buffer. RETURN preflight failures keep the in-flight donation intact; delivery/return cancellation paths return or clear the donation without allocating. A donor with an in-flight token is blocked from returning to userspace until the endpoint call returns or is canceled. Nested donation of an already donated context is rejected until stacked return tokens have a dedicated design. The focused make run-scheduling-context smoke now includes a same-process endpoint round trip with endpoint_donation=ok, endpoint_return=ok, endpoint_exception_return=ok, endpoint_invalid_return=ok, and endpoint_nested_rejected=ok, plus an endpoint_donor_block=ok delayed-server cap_enter(0, 0) proof, an endpoint_donor_fast=ok fast-return race proof, and remaining-budget fields for successful RETURN, application-exception RETURN, invalid-result RETURN, nested-donation rejection, donor blocking, and fast donor return. This is synchronous endpoint donation/return only; depletion notifications, realtime islands, SQPOLL, auto-nohz, CPU placement enforcement, and session-logout stale-context coverage remain future work.
Add a scheduler-observable session lifecycle hook from UserSession.logout() into scheduler-owned SchedulingContext stale-marking. The hook covers explicit logout plus the remote DTO gateway logout/connection-teardown paths that already call UserSession.logout(): after the liveness cell flips to logged out, the scheduler scans process/thread metadata for the same session liveness cell, removes non-donated matching bindings from its ledger, and advances the bound context generation as revoked so ordinary old grants become stale. The hook preserves the scheduler as the binding authority and avoids scheduler-lock to context-record-lock inversion by taking one binding under the scheduler lock, dropping that lock, and then marking the context stale through its cleanup token. In-flight endpoint donation bindings are explicitly skipped because returning donor budget before endpoint cancellation would violate the donor-blocking invariant. This hook unblocks focused stale-context proofs: ordinary non-donated logout, donated-context policy, and local owner-shell propagation are now closed by their dedicated task records.
Add timeout/depletion notifications with preallocated emergency-path storage. Completed in the timeout/depletion notification slice: every SchedulingContext owns a fixed notification cell allocated at context creation/bootstrap, with coalescing slots for budget depletion and deadline/timeout, sequence counters, bounded coalesced-event counts, holder identity, donated-holder marking, remaining budget, and next timestamp snapshots. Scheduler charging, timeout/deadline observation, donation-return, and cancellation paths update only that fixed state; they do not allocate, publish result caps, append unbounded queues, or require hard-path logging. SchedulingContext.drainNotifications() exposes typed ok, revoked, and staleGeneration observer results, plus explicitRevoke lifecycle state. The focused make run-scheduling-context smoke proves repeated budget-depletion coalescing, deadline notification, explicit revoke, stale observer labels, and endpoint-donated notification accounting. A pre-armed observer waiter/wakeup path remains a separate follow-up.
Extend stale-context proofs beyond the first revoke/generation contract to process and thread exit. The focused SchedulingContext smoke now proves that a context bound by an exiting thread becomes unbound without minting fresh budget on rebind, while process-exit and explicit process-termination children bind contexts and run the process cleanup path before cap-table release.
Extend stale-context proofs to session logout. Completed for ordinary non-donated contexts at 2026-05-11 17:44 UTC. This remains separate from process/thread exit because logout propagation is owned by the session lifecycle surface, not the scheduler dispatch loop. The focused session-context smoke now binds a SchedulingContext in a session-owned child, calls UserSession.logout(), observes the scheduler hook line, and proves the old cap is stale before budget refresh, caller-thread rebind, result-cap publication, or metadata mutation. Process/thread exit cleanup remains covered by make run-scheduling-context.
Prove donated receiver logout policy. Completed at 2026-05-11 18:19 UTC. Logout keeps the existing conservative counted/skipped behavior for receiver threads holding endpoint-donated SchedulingContext bindings. The focused session-context smoke has a donor call a guest-session receiver, the receiver logs out while holding the donated binding, the scheduler hook reports stale_marked=0 donation_inflight_skipped=1, the donor remains blocked in cap_enter(0, 0) until endpoint RETURN, and the donor context returns bound with reduced remaining budget rather than a refreshed or minted budget. Local owner-shell lifecycle propagation was closed separately by scheduler-phase-e-local-owner-shell-logout-propagation.
Propagate local owner-shell exit to session logout. Completed at 2026-05-11 19:36 UTC. Clean local REPL exit and terminal-close completion now call the held UserSession.logout() before process exit, so the session liveness cell is marked logged out through the same kernel hook used by explicit logout and the remote DTO gateway. The shell smoke asserts the scheduler-observable hook line with stale_marked=0 donation_inflight_skipped=0; ordinary bound SchedulingContext stale behavior remains proven by the focused session-context smoke through the same hook. Process/thread-exit cleanup remains separate and unchanged.

Phase F: CPU Isolation Lease and SQPOLL

The Phase E gates and the first Ring/SQPOLL ownership prerequisite are now closed. Dispatch through scheduler-phase-f-auto-nohz-sqpoll only through its own Phase F authority, telemetry, rollback, and nohz/SQPOLL tasks; this backlog entry does not implement Phase F behavior. The concrete ring prerequisite is scheduler-phase-f-one-sq-consumer-ring-ownership, closed on 2026-05-11: ring endpoints now have generation-checked syscall-mode SQ-consumer leases, duplicate future SQPOLL acquisition is rejected while that owner is live, stale owner generations cannot advance SQ head, teardown releases the owner without clearing accepted completions, and bounded SQPOLL admission metadata exists without starting a poller. The first executable Phase F child task, scheduler-phase-f-cpu-isolation-lease-scaffold, closed on 2026-05-12 12:02 UTC. It is limited to CpuIsolationLease authority, activation preflight telemetry, and rollback scaffolding. It does not enable SQPOLL, automatic nohz, tick suppression, automatic CPU isolation, or generic full-nohz behavior. The second executable child task, scheduler-phase-f-nohz-activation-telemetry, closed on 2026-05-12 14:18 UTC. It turns the disabled preflight into observable activation/deactivation and rollback decisions while still leaving tick suppression, SQPOLL, automatic CPU isolation, and generic full-nohz disabled. The housekeeping/deferred-work placement child closed on 2026-05-12 18:36 UTC by scheduler-phase-f-housekeeping-deferred-work-placement: the scheduler now records an explicit online housekeeping CPU placement input, selected housekeeping mask, deferred cleanup/timer/network/IRQ/accounting placement or rejection labels, and bounded revoke, process-exit, service-replacement, and session-logout cleanup placement while ticks remain periodic. The bounded SQPOLL ring-mode child closed on 2026-05-12 20:29 UTC by scheduler-phase-f-sqpoll-ring-mode-bounded-poller: ring endpoints now transition explicitly through syscall, SQPOLL starting, running, sleeping, stopping, and rollback modes; a kernelSqpoll CpuIsolationLease admits one bounded periodic-tick poller for the caller thread’s ring; producer wakeups use NEED_WAKEUP; stale SQ owners fail before SQ-head consumption; and poller stop/revoke preserves accepted CQEs while releasing SQ ownership. Actual tick suppression is blocked until the SQPOLL progress path no longer depends on periodic scheduler ticks. The clockevent/deadline substrate child closed on 2026-05-12 23:07 UTC by scheduler-phase-f-clockevent-deadline-substrate: normal QEMU/x86_64 monotonic_ns() is backed by the calibrated TSC rather than TICK_COUNT, the periodic LAPIC tick disciplines the TSC epoch while nohz is disabled, Timer.sleep, finite cap_enter, and park waiters store absolute monotonic deadlines, and the LAPIC clockevent backend can program a bounded one-shot deadline and restore periodic mode. The substrate’s firing precision is now proven, not only its programming: the scheduler-lapic-oneshot-subtick-firing-precision child (closed 2026-06-04 03:26 UTC, commit 49b36129) arms a TICK_NS/2 one-shot over the live periodic timer during boot and measures the actual countdown-to-fire instant, asserting via make run-scheduling-context that it fires sub-tick (~5 ms for a 5 ms request, well under the 10 ms tick) with the current-count correctly reset to the sub-tick value – ruling out the suspected “INITIAL_COUNT write does not reset the running countdown” root cause – and that the kernel-mode-fire periodic restore leaves a live timer (no lost-timer hang). Automatic nohz, tick suppression, SQPOLL nohz, generic full-nohz, and production realtime admission remain disabled. Known pre-existing gate flake (independent of the firing-precision proof, which passed in 100% of measured boots): the scheduling-context-smoke budget-timing proof exited early in ~20% of boots on both main and this branch under host load – its wall-clock budget-throttle assertions are sensitive to host scheduling jitter. Run make run-scheduling-context on an otherwise-idle host until the budget proof is stabilized (own follow-up); it is orthogonal to the clockevent firing assertions. A second substrate prerequisite surfaced 2026-06-04 from scheduler-deadline-driven-budget-accounting’s Attempt 2: even with the LAPIC one-shot firing precisely sub-tick, the monotonic clocksource discipline floored a sub-tick interval to a full tick. A boot probe measured a real 5.0 ms interval advancing monotonic_ns by 10.0 ms after one discipline_clocksource_tick step (monotonic_delta_ns=10000020 for real_ns=5000118, floored=true), because discipline_clocksource_tick took max(tsc_interpolated, epoch + TICK_NS) on every fire. That was the real cause of that task’s Attempt 1 “9.85 ms” – not the LAPIC firing (fixed) and not the ordinary-path timer-ISR rechecks (which provably no-op when no nohz/idle window is active). The prerequisite scheduler-monotonic-clocksource-subtick-discipline closed it (2026-06-04): discipline_clocksource_tick now trusts the TSC interpolation at sub-tick granularity, falling back to the TICK_NS floor only when the interpolated advance is below MIN_DISCIPLINED_ADVANCE_NS (TICK_NS / 8) so a degenerate (stalled/backward/mis-calibrated-slow) TSC still keeps a minimum forward rate; the tick-derived fallback is unchanged. A boot proof (context::qemu_clocksource_subtick_discipline_proof, emitted on make run-scheduling-context) runs one real TICK_NS / 2 discipline step and asserts monotonic_ns() tracked the sub-tick delta – measured monotonic_delta_ns=5055612 for real_ns=5000474 (floored=false, subtick_tracked=true). Deadline-driven budget accounting and generic full-nohz can now observe a sub-tick deadline through the accounting clock. The SQPOLL nohz-progress child closed on 2026-05-13 00:06 UTC by scheduler-phase-f-sqpoll-nohz-progress: cap_enter now has a bounded current-thread SQPOLL service entry for producer wakes and syscall kicks that borrows the SQPOLL owner lease, charges the admitted accounting target, and reports non-periodic progress evidence while ordinary periodic service remains active. Automatic policy-service nohz issuance and production realtime admission remain future work; generic SQPOLL nohz for explicitly leased caller-thread rings landed in the later Step 14 slice. The tickless-idle child closed on 2026-05-23 09:12 UTC by scheduler-tickless-idle-step6: the CPL0 idle loop now admits an idle-only tickless window when no non-idle work is runnable, no nohz lease is active, no local deferred cleanup is pending, no cap-enter polling dependency is present, and the LAPIC one-shot clockevent plus monotonic clocksource are available. The periodic tick is restored before non-idle dispatch and on rollback. Legacy cap-enter polling surfaces, including the terminal shell path, remain periodic until they gain explicit deadline or housekeeping placement.

Define CpuIsolationLease authority separately from CPU-time budget. Completed 2026-05-12 12:02 UTC by scheduler-phase-f-cpu-isolation-lease-scaffold.
Add scheduler activation proof for housekeeping, deferred cleanup, timers, networking, IRQ affinity, live accounting target, one-SQ-consumer state, and revocation latency. The scaffold reports blocked eligibility and leaves ticks/nohz/SQPOLL disabled.
Enforce one live SQ consumer per ring before SQPOLL. Completed 2026-05-11 by scheduler-phase-f-one-sq-consumer-ring-ownership.
Integrate SQPOLL ring mode only after this ownership prerequisite and scheduler-phase-f-housekeeping-deferred-work-placement have landed. Completed 2026-05-12 20:29 UTC by scheduler-phase-f-sqpoll-ring-mode-bounded-poller.
Add lease revocation on explicit revoke, process exit, service replacement, and session close. Completed by the focused make test-scheduler-cpu-isolation-lease proof.
Add nohz activation/deactivation telemetry. Completed 2026-05-12 14:18 UTC by scheduler-phase-f-nohz-activation-telemetry. The proof records active-candidate rejection, stale/revoked rollback, ready housekeeping CPUs under -smp 4, exactly-one-runnable target CPU evidence, deferred cleanup/timer/network/IRQ labels, valid accounting targets, explicit clocksource/accounting readiness or refusal, live syscall SQ-consumer state, revocation-latency policy, and disabled tick/SQPOLL/full-nohz guardrails.
Assign housekeeping and deferred-work placement before behavior. Completed 2026-05-12 18:36 UTC by scheduler-phase-f-housekeeping-deferred-work-placement. The proof keeps periodic ticks, SQPOLL, automatic CPU isolation, and generic full-nohz disabled.
Add bounded SQPOLL ring mode only after housekeeping/deferred-work placement. Completed 2026-05-12 20:29 UTC by scheduler-phase-f-sqpoll-ring-mode-bounded-poller. The proof covers one poller owner, bounded polling, stale queue-owner rejection, wake/sleep ordering, and teardown without losing completions while periodic ticks remain active.
Add clockevent/deadline substrate before automatic nohz activation. Completed 2026-05-12 23:07 UTC by scheduler-phase-f-clockevent-deadline-substrate. It split clocksource reads from clockevent programming, added a one-shot/restore timer backend, and converted tick-count waiters to absolute monotonic deadlines while ordinary scheduling remains periodic.
Add SQPOLL nohz progress that does not depend on periodic scheduler ticks. Completed 2026-05-13 00:06 UTC by scheduler-phase-f-sqpoll-nohz-progress. The proof preserves the one-SQ-consumer, NEED_WAKEUP, bounded polling, stale-owner rollback, and teardown/completion invariants while keeping periodic fallback service active.
Add automatic nohz activation only after placement, bounded SQPOLL behavior, the deadline substrate, and non-periodic SQPOLL progress. Completed 2026-05-14 09:01 UTC by scheduler-phase-f-auto-nohz-activation. The CpuIsolationLease activation preflight now performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window (namedRing = none compute lease on the preflight CPU): it masks the periodic LAPIC tick and arms a bounded one-shot deadline at min(nearest pending timer wakeup, now + max revocation latency). Network polling and IRQ affinity stay read-only fail-closed admission gates – any ring-coupled or device-owning mode keeps the conservative refusal. Every disqualifying change (stale lease generation, a second runnable entity, stealable sibling work, a local deferred-cleanup dependency, a target-CPU mismatch, or a one-shot backend that can no longer arm a deadline) rolls the CPU back to the periodic tick first. The make test-scheduler-cpu-isolation-lease proof asserts the activation and rollback log lines. Generic full-nohz and the broader SQPOLL-driven nohz state machine landed in later slices.
Measured suppressed-tick proof on the lease path (harness-hardening). Completed 2026-06-02 19:53 UTC by scheduler-cpu-isolation-measured-suppressed-tick-proof. Closes the review-identified honesty gap that the lease path proved suppression only by the tick_suppression=active periodic_tick=masked marker plus a no-hang progress loop, never that periodic timer interrupts actually stopped arriving. The kernel now counts genuine periodic LAPIC fires per CPU (account_timer_fire in the timer ISR increments only when neither the lease-backed nor idle tick-suppression bit is set, so the one-shot replacement is never miscounted), snapshots the count at activation, and on rollback emits cpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w> expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>; a bounded post-rollback cpu-isolation: nohz restored-rate line proves the periodic rate returns. The demo holds a childless compute lease on CPU 0 across a ~150 ms masked window, then a busy restore window; the harness asserts a masked window with actual_periodic near zero (expected_periodic >= 10, suppressed >= 8) and a restored window with actual_periodic tracking expected_periodic (>= 8). No activation behavior changed; the mask/one-shot mechanism is untouched. A durable ticks_suppressed{cpu,mode} telemetry field on a monitoring/status surface remains future work.
Timeout-based auto-revoke primitive on CpuIsolationLease. Landed via scheduler-cpu-isolation-lease-timeout-auto-revoke. Adds leaseLifetimeNs @6 to CpuIsolationLeaseSpec (0 = no expiry, preserving every existing producer); read_spec clamps to a one-hour ceiling and rejects a non-zero lifetime below maxRevocationLatencyNs (invalidSpec). A lease records expires_at_ns at creation; the first observation past the deadline auto-revokes through the existing generation-advancing cleanup (reason=lease-expired, registry unregister, SQPOLL stop, rollback_nohz_for_lease) and every subsequent info/activationPreflight/revoke reports staleGeneration. The nohz activation record carries the lifetime deadline so a tickless CPU under a lease that crosses its lifetime rolls back at the next timer/IPI recheck (lease-lifetime-expired disqualifier), bounded by maxRevocationLatencyNs. make test-scheduler-cpu-isolation-lease asserts the expiry release line, the post-expiry staleGeneration, and the invalidSpec rejection.
Enable tickless idle only when there is no runnable non-idle work and no cap-enter polling dependency. Completed 2026-05-23 09:12 UTC by scheduler-tickless-idle-step6. The idle path masks the periodic LAPIC tick only for true idle, arms a bounded one-shot at the nearest Timer/ParkSpace deadline or 100 ms housekeeping floor, and restores periodic mode before ordinary work. Ready-but-budget-throttled SchedulingContext retry windows remain periodic so budget replenishment and deadline notification timing stay on the existing scheduler accounting path.
Keep automatic full-nohz behind the completed one-SQ-consumer ownership prerequisite and the narrower CpuIsolationLease telemetry/rollback proof. Generic full-nohz is not the first Phase F implementation task.

Phase F.5: Full-SMP Hardware Scalability

This phase is the planning slot for the next visible SMP milestone when the project is ready to answer whether capOS uses 16/32-core machines well. It does not replace the current Installable System selected milestone and should not be dispatched as a QEMU-only benchmark cleanup. QEMU remains regression infrastructure; the primary performance record should come from direct capOS execution on a dedicated high-core perf runner or bare-metal/cloud-bare-metal machine.

Replace temporary four-owner scheduler assumptions with dynamic CPU topology: discovered scheduler CPU set, physical-core versus SMT sibling labeling, APIC id mapping, per-CPU allocation sizing, and boot/status output that makes the selected CPU set auditable.
Add or select the APIC backend needed for high-core machines. xAPIC MMIO can remain the current low-core path, but x2APIC selection is the likely larger-APIC-id follow-up from docs/research/x2apic-and-virtualization.md.
Shrink scheduler shared-state serialization. Local pick/requeue should avoid one global scheduler-lock critical section where possible, while shared process/thread metadata, blocking waiters, direct IPC handoff, timers/deadlines, and cleanup keep explicit ownership and rollback rules.
Add topology-aware placement and observable migration policy. The record should distinguish local enqueue, cross-core wake, steal, SMT sibling placement, failed placement, reschedule IPI, and TLB-shootdown costs.
Build the hardware benchmark profile from existing benchmark proposals: static map/reduce, uneven dynamic task pool, barrier phase loop, independent processes, same-process threads, and one capability-call/service-bound workload. Each workload reports work-window and total-time rows at 1/2/4/8/16/32 workers when hardware exists.
Record matching native Linux rows on the same machine, plus capOS raw artifacts with source commit, toolchain, topology, frequency/isolation policy, run count, warmup policy, verifier output, medians, variance, speedup, efficiency, and scheduler counters.

Phase G: Realtime Islands

Define RealtimeIsland admission inputs: scheduling contexts, memory reservations, device/IRQ reservations, communication paths, CPU leases, and overrun policy.
Add a small local-audio or synthetic periodic-control proof before robotics or provider workloads.
Prove no allocation, blocking endpoint call, paging, or logging on the admitted realtime path.
Record deadline misses and overrun handling as observable output.

Phase H: Policy Service

Define a privileged scheduler policy service interface for admission, budget/profile updates, CPU lease grant/revoke, and diagnostics.
Keep kernel fallback scheduling independent of policy-service liveness.
Add manifest/config hooks for default profiles without making policy changes require kernel rebuilds.
Add operator diagnostics that explain why a thread or island was denied, throttled, migrated, or revoked.
Define how stateful task/job graph assignment metadata maps into scheduler policy inputs: graph priority to weight/latency class, graph deadline to request freshness or admission input, graph budget to SchedulingContext reference, and graph queue to policy-service placement. The graph coordinator must not mint CPU authority by itself.
Design the user-space policy-service AutoNoHz placement heuristic for ordinary threads that appear capable of utilizing a full CPU core – landed at commit 14d852f3 (2026-07-14 13:23 UTC): the AutoNoHz Decomposition below carried it through the Step 17 policy-daemon capstone (scheduler-autonohz-production-policy-daemon, demos/autonohz-policy-daemon). The policy service synthesizes the “thread appears capable of utilizing a full CPU core” decision from a future monitoring/status surface and issues a bounded CpuIsolationLease against a pre-authorized account or session CPU pool. The lease is placement only; it does not mint CPU-time authority. Required bounds on every auto-issued lease: lifetime shorter than admin-issued leases by default and renewable only by re-observing the signal; max_revocation_latency_ns bounded by NoHzEligibility; accounting target a live SchedulingContext or coarse ResourceLedger; CPU set restricted to the operator-declared auto-claim pool; priority-aware fairness preemption that terminates the lease (not just rolls back tick suppression) on arrival of an equal-or-higher priority runnable entity. Prerequisites: (a) a timeout-based auto-revoke primitive on CpuIsolationLease – LANDED 2026-05-30 as leaseLifetimeNs @6 (0 = no expiry) with enforced first-observation auto-revoke and a lease-lifetime-expired nohz rollback; the auto-claim placement lease can now be granted with a bounded lifetime. The bounded renew half LANDED as CpuIsolationLease.renew @4, which pushes the deadline forward by at most the original lifetime while keeping the lease’s identity / accounting / nohz state; the renewal-by-re-observation heuristic (when to call renew) landed with the policy daemon’s per-target renewal budget; (b) the monitoring/status surface that exports per-thread saturation observation – LANDED 2026-05-30 as the non-measure per-thread saturation status surface. voluntary_blocks and preemptions were promoted out of cfg(feature = "measure"), an always-built runnable_accumulated_ns runnable-but-not-running accumulator was added (stamped at the run-queue enqueue chokepoint, accumulated at selection), and all three plus runtime_ns are exported through SchedulingPolicyCap.snapshot @2 (proof make test-thread-fairness: hog voluntary_blocks=0 with live preemptions/runnable_ns). migrations stays measure-gated. This read-side surface exports raw cumulative counters only; windowing and the saturation decision remain policy-service work; (c) the pool-grant authority shape that lets an operator pre-authorize an account’s auto-claim pool. Declared-pool descriptor LANDED 2026-05-30: the CpuIsolationLeaseSpec carries poolId @7 (0 = the implicit default pool over every scheduler CPU), the kernel seeds a fixed declared-pool registry (CpuIsolationPoolDescriptor: default pool 0 plus one declared non-default pool 1 over a single CPU), and read_spec admits a lease only when its poolId is declared and its allowedCpuMask is a subset of the pool’s CPU mask – echoing the admitting pool’s id/mask through CpuIsolationLeaseInfo (proof make test-scheduler-cpu-isolation-lease: nondefault_pool=invalidSpec (undeclared id), declared_pool=ok admitted_pool_id=1 admitted_pool_cpu_mask_subset=true, declared_pool_mask_violation=invalidSpec, default_pool_id=0). Manifest-sourced pool table LANDED 2026-05-30: the declared-pool registry is sourced from the boot manifest SystemConfig.cpuIsolationPools @14 (each entry a CpuIsolationPoolDescriptor), with the in-kernel constant as the fail-closed default when the manifest omits/empties the list; the kernel validates each entry fail-closed at boot (canonical CPU mask subset of the scheduler mask, default pool 0 synthesized if omitted, duplicate ids rejected) and emits cpu-isolation: declared-pools source=manifest count=3 ... (proof make test-scheduler-cpu-isolation-lease; kernel-default fallback proven by cargo test-config decode/empty assertions). Per-pool live-lease capacity bound LANDED 2026-05-31: CpuIsolationPoolDescriptor carries poolMaxLeases @2 (0 = unbounded); a non-zero value caps the number of simultaneously live (non-revoked, current-generation) leases the kernel admits against that pool at create-time, counted from the existing LEASE_REGISTRY after prune_dead, rejecting an over-capacity create fail-closed resourceExhausted. The manifest bounds pool 2 at poolMaxLeases: 2; the proof admits two live leases, refuses a third (cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2 result=resourceExhausted, pool_capacity_exceeded=resourceExhausted), and reclaims after a revoke (pool_capacity_reclaimed=ok) – live-count, not cumulative. This is the count+reject mechanism the per-account N policy keys onto. Account identity + per-account N LANDED 2026-05-31: CpuIsolationLeaseSpec carries accountId @8 :UInt64 (0 = unattributed, caller-asserted and inert until counted, echoed read-only through CpuIsolationLeaseInfo.accountId @6) and CpuIsolationPoolDescriptor carries poolMaxLeasesPerAccount @3 :UInt32 (0 = unbounded per account). After the pool-wide check, register counts the requesting account’s live entries (admitted_pool_id AND account_id both matching) against the per-account bound and rejects an over-bound create fail-closed resourceExhausted (0 account or 0 bound skips the gate). The manifest bounds pool 2 at poolMaxLeasesPerAccount: 1; the proof admits one account-7 lease, refuses a second account-7 create (cpu-isolation: account-capacity-rejected admitted_pool_id=2 account_id=7 account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted, account_capacity_exceeded=resourceExhausted), admits a different account-9 lease on that CPU (account_capacity_other_account=ok – per-account, not pool-wide), and reclaims after revoking account-7 (account_capacity_reclaimed=ok). The account id is caller-asserted, not yet authenticated. Bootstrap pool-grant authentication LANDED 2026-05-31: CpuIsolationPoolGrant (schema/capos.capnp, source cpu_isolation_pool_grant, kernel kernel/src/cap/cpu_isolation_pool_grant.rs) introduced a bootstrap-staged grant binding one authenticated account to one declared pool. createLease stamps the bound account/pool onto the minted lease, overriding any caller-asserted accountId/poolId, and reuses the exact lease-create admission path (cpu_isolation::create_lease_for_caller), so the per-account bound is unforgeable: a holder can no longer assert another account to evade poolMaxLeasesPerAccount. The initial proof used one account-7/pool-2 grant; the current manifest-sourced proof below exercises multiple seeded grants. Manifest-declared multi-account grant table LANDED 2026-06-01: the grant binding is now operator-declared via SystemConfig.cpuIsolationPoolGrants (schema/capos.capnp, decoded in capos-config, seeded at boot by cpu_isolation_pool_grant::seed_pool_grants after seed_declared_pools), mirroring the manifest-sourced cpuIsolationPools table; the cpu_isolation_pool_grant / cpu_isolation_pool_grant_secondary sources stage seeded binding index 0 / 1, so a manifest can pre-authorize multiple distinct (account, pool) grants, each staged as its own bootstrap cap. An absent/empty list falls back to one in-kernel binding at index 0: account 7 bound to preferred pool 1 when active, otherwise account 7 bound to synthesized default pool 0, so manifest-sourced pool tables that omit pool 1 still stage a usable default grant. Proof make test-scheduler-cpu-isolation-pool-grant now boots a two-entry grant table (account 5/pool 1, account 8/pool 2), holds both grant caps, and proves each stamps its OWN bound account (pool-grant: create ok bound=A stamped_account_id=5 ... / bound=B stamped_account_id=8 ...) with the per-account bound still enforced fail-closed under the manifest-sourced path; boot evidence cpu-isolation: pool-grants source=manifest count=2. Fallback proof make test-scheduler-cpu-isolation-pool-grant-default boots a manifest-sourced pool table that declares pool 2 and omits pool 1 plus an empty grant list; the kernel stages one default grant as (account 7, pool 0) and the smoke proves it can mint a stamped lease. Runtime grant minting landed (CpuIsolationGrantMinter): one cap mints a fresh CpuIsolationPoolGrant for an operator-chosen (account, pool) at call time, bounded by the declared SystemConfig.cpuIsolationGrantMinterAllowlist (an out-of-allowlist mint is refused unauthorized, so it is never an ambient grant-any authority; the minted grant reuses the same unforgeable createLease admission path). The same test-scheduler-cpu-isolation-pool-grant smoke now also mints a grant for the allowed (account 6, pool 2), proves its createLease stamps account 6 and stays bounded by the per-account gate, and proves an out-of-allowlist (account 99, pool 2) mint is refused; boot evidence cpu-isolation: grant-minter-allowlist source=manifest count=1. Grant-revocation lifecycle landed (CpuIsolationGrantMinter.revokeGrant): a runtime-minted grant gets a revocable (grantId, generation) identity; revokeGrant(grantId) advances the grant generation so a stale grant handle’s createLease fails staleGeneration, and cascades to every live lease minted through it – reusing the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) so the per-pool/per-account live-lease capacity frees immediately and a fresh grant is admitted into the reclaimed slot. Double-revoke is alreadyRevoked and an unknown grantId is unknownGrant, both fail-closed. The same test-scheduler-cpu-isolation-pool-grant smoke proves the full lifecycle. This closes Track C (prerequisite (c)) – operator grant authority is now mint + revoke complete. Detailed design in docs/proposals/tickless-realtime-scheduling-proposal.md “Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads”.

AutoNoHz Decomposition: Roadmap to Full Auto-NoHz

The status bullet above narrates what landed. This subsection is the discrete dispatchable decomposition from the current landed state to full operator-driven auto-nohz, so the path is written as concrete slices rather than “future work” prose. Grounding: the proposal’s “Policy-Service Userstories: AutoNoHz Placement”, “Bounds the policy service must enforce”, “Telemetry Requirements”, and Implementation Sequence steps 7/14/17.

Landed substrate (not repeated below): the narrow manual per-CPU LAPIC tick-mask for the single-runnable compute window and the SQPOLL-coupled window, tickless idle, prerequisite (a) leaseLifetimeNs @6 timeout auto-revoke, prerequisite (b) the SchedulingPolicyCap.snapshot @2 saturation observation surface, and prerequisite (c) pool-grant authority now mint + revoke complete (the manifest-declared multi-account cpuIsolationPoolGrants @15 table, runtime grant minting through CpuIsolationGrantMinter, and the grant-revocation lifecycle that cascades to minted leases). Fairness lease termination (Track D) and a measured suppressed-tick proof have also landed, as have network-poll and IRQ-affinity housekeeping routing, kernel-side generic full-nohz admission for ordinary budgeted compute threads, and generic SQPOLL nohz admission for explicitly leased caller-thread rings. The reusable multi-account policy daemon landed at commit 14d852f3 (2026-07-14 13:23 UTC; demos/autonohz-policy-daemon, Step 17 capstone below). What the name “auto nohz” still oversells today: the daemon manages threads it spawns (no cross-process target discovery – SchedulingPolicyCap.snapshot is per-calling-thread), and broader userspace-poller/device-queue issuance remains future work.

Conflict-domain note: every kernel slice here shares resource:scheduler-cpu-isolation and writes kernel/src/cap/cpu_isolation* or kernel/src/sched.rs, so they serialize against each other – dispatch the chain head first; the rest convert from this list into loopyard task records as their depends_on closes. Slices marked ready have a task record in loopyard; the rest stay here until their prerequisite lands.

Next increment (decomposed 2026-06-04 00:18 UTC; updated 2026-06-07 after generic SQPOLL nohz landed): Track C, Track D, and the measured suppressed-tick proof are all landed, and the ordinary-thread and SQPOLL-ring kernel admission leaves are now done. Records in loopyard capture: scheduler-cpu-isolation-lease-renewal-on-reobservation (renewal residual), scheduler-nohz-irq-affinity-housekeeping-routing, scheduler-nohz-network-poll-housekeeping-routing, scheduler-deadline-driven-budget-accounting, and scheduler-generic-full-nohz-arbitrary-threads as done. The operator-driven AutoNoHz capstone – the reusable policy daemon (scheduler-autonohz-production-policy-daemon) – landed at commit 14d852f3 (2026-07-14 13:23 UTC). These scheduler CPU-isolation slices serialize against each other on resource:scheduler-cpu-isolation but are parallel-safe against the in-flight Phase C network-stack lane, so the scheduler lane stays runnable whenever Phase C 7c holds the kernel cap/ surface.

Track C – complete operator grant authority (prerequisite (c) residual):

scheduler-cpu-isolation-runtime-grant-minting – behavior, normal, LANDED 2026-06-02 22:24 UTC. One cap (CpuIsolationGrantMinter) mints a fresh CpuIsolationPoolGrant for an operator-chosen (account, pool) at call time, bounded by the declared SystemConfig.cpuIsolationGrantMinterAllowlist (an out-of-allowlist pair is refused unauthorized), instead of only the boot-seeded table. The minted grant reuses the same unforgeable createLease admission path. Proof make test-scheduler-cpu-isolation-pool-grant. depends_on: manifest-multi-account grant table (landed).
scheduler-cpu-isolation-grant-revocation-lifecycle – behavior, normal, LANDED 2026-06-03 17:11 UTC. CpuIsolationGrantMinter.revokeGrant revokes a runtime-minted grant by advancing its (grantId, generation) so later createLease through the stale handle fails staleGeneration and mints nothing; revocation cascades to every live lease minted through that grant, driving the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) once per tagged lease so per-pool/per-account capacity frees immediately (a fresh grant’s lease is admitted into the reclaimed slot in the proof). Double-revoke is alreadyRevoked, unknown grantId is unknownGrant, seeded grants stay un-revocable. Closes Track C. Proof make test-scheduler-cpu-isolation-pool-grant. depends_on: scheduler-cpu-isolation-runtime-grant-minting (landed), scheduler-cpu-isolation-priority-aware-lease-termination (landed).

Track D – fairness preemption (proposal fairness_preemption):

scheduler-cpu-isolation-priority-aware-lease-termination – behavior, normal, LANDED 2026-06-02 21:17 UTC. On arrival of an equal-or-higher policy-priority runnable on the leased CPU when no other CPU authorized by both the admitted pool and the lease allowedCpuMask is eligible, the kernel now terminates (revokes) the lease itself at the existing nohz rollback site (fairness-preempted ... result=lease-terminated), not just restores the periodic tick, bounded by maxRevocationLatencyNs. The recheck compares the static WFQ policy priority (latency_class, weight) of the arriving entity against the captured leased thread; a strictly-lower arrival or an eligible sibling CPU inside both masks keeps the existing tick-restore-only behavior. The termination runs the same generation-advancing cleanup leaseLifetimeNs expiry uses (reason=fairness-preempted) immediately after the scheduler restores the periodic tick, so a subsequent info/revoke reports staleGeneration and placement/account capacity is freed without waiting for the holder’s next cap call. Proven in make test-scheduler-cpu-isolation-lease (default pool 0 with allowedCpuMask=0x01: an equal-priority sibling terminates and capacity is reclaimed, a strictly-lower sibling restores only). Out: no re-placement onto an eligible sibling CPU (the “no sibling eligible” condition is recorded; actual migration is generic-full-nohz work). depends_on: auto-nohz-activation (landed).

Lease lifetime renewal (proposal lifetime_ns renewal residual):

scheduler-cpu-isolation-lease-renewal-on-reobservation – behavior, normal, landed. CpuIsolationLease.renew @4 pushes expires_at_ns forward to now + leaseLifetimeNs (clamped to the same one-hour ceiling read_spec enforces), keeping the same (leaseId, generation), accounting binding, and nohz activation state. Callable only before expiry: a revoked, auto-revoked, or past-deadline lease stays stale (staleGeneration) and is not resurrected, and an unbounded leaseLifetimeNs = 0 (or factory) lease reports notRenewable. The renewed deadline is propagated to a tickless CPU’s nohz activation record (renew_nohz_lifetime_deadline_for_lease) so the lease-lifetime-expired disqualifier no longer rolls it back at the old deadline. CpuIsolationLeaseInfo.expiresAtNs echoes the deadline read-only. The kernel primitive the policy service uses to renew an auto-issued lease by re-observing the saturation signal; the re-observation heuristic landed with the Step 17 policy daemon. Proof make test-scheduler-cpu-isolation-lease. depends_on: timeout-auto-revoke (landed).

Honesty / telemetry (proposal Telemetry ticks_suppressed{cpu,mode}):

scheduler-cpu-isolation-measured-suppressed-tick-proof – harness-hardening, normal, LANDED 2026-06-02 19:53 UTC (scheduler-cpu-isolation-measured-suppressed-tick-proof). A kernel expected-vs-actual periodic-tick counter (account_timer_fire, counted only when no tick-suppression bit is set) over a bounded nohz window is asserted in make test-scheduler-cpu-isolation-lease (cpu-isolation: nohz suppressed-ticks ... plus a restored-rate line), so the proof shows the periodic tick actually stopped firing, not only that the mask write was issued and the CPU made progress. Closed the review-identified honesty gap. A durable ticks_suppressed{cpu,mode} telemetry field on a monitoring/status surface remains future work. depends_on: auto-nohz-activation (landed).

Step 7 – network poll housekeeping/deadline routing:

scheduler-nohz-network-poll-housekeeping-routing – behavior, normal, landed 2026-06-04 04:48 UTC. The in-kernel virtio-net poll (virtio::poll_scheduler) now routes off a lease-isolated (tickless) CPU: it consults sched::current_cpu_lease_nohz_active() and skips, emitting a bounded cpu-isolation: network-poll routed ... result=skipped-on-isolated-cpu record, while the always-ticking housekeeping CPU the admission requires keeps the poll progressing. The network_polling admission gate flips from the hard rejected-periodic-network-polling-not-routed-to-housekeeping refusal to a housekeeping-conditioned routed-periodic-network-polling-to-housekeeping-cpu admit (eligibility accepts the routed- prefix), and fails closed (rejected-network-polling-no-housekeeping-cpu-to-relocate) when no housekeeping CPU exists. The admitted named_ring=None lease carries the routed label tick-suppressed; the CallerThread compute-with-ring lease’s network refusal is removed but it stays ForcedPeriodic because IRQ affinity routing is the separate slice below. Proof make test-scheduler-cpu-isolation-lease; regression make run-net. depends_on: housekeeping-deferred-work-placement (landed), auto-nohz-activation (landed).
scheduler-nohz-irq-affinity-housekeeping-routing – behavior, normal, landed. The activation path reroutes an opting-in leased CPU’s legacy IO-APIC redirection-entry destinations onto the selected housekeeping CPU (mask-before-reprogram + read-back, restored on rollback/revoke) before admitting tick suppression, and keeps the conservative rejected-irq-affinity-not-routed-to-housekeeping refusal for a ring-coupled IRQ dependency that cannot be safely rerouted. Proof make test-scheduler-cpu-isolation-lease (irq-affinity ok ... routed_admitted=true restored_on_revoke=true residual_forced_periodic=true); DDF test-interrupt-grant / test-devicemmio-grant stay green. Scoped to a quiescent housekeeping destination: under the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination onto an actively-scheduling CPU stalls that CPU’s forward progress, so the live reroute is gated to a focused proof lease (reroute sentinel maxRevocationLatencyNs) whose destination is idle. A general busy-destination reroute remains future work behind a destination-quiescence gate or a non-KVM-irqchip delivery backend. depends_on: auto-nohz-activation (landed).

Step 14 – generic SQPOLL nohz for arbitrary rings:

scheduler-generic-sqpoll-nohz-arbitrary-rings – behavior, normal, done 2026-06-07. The SQPOLL nohz state machine now admits explicitly leased caller-thread rings when the SQPOLL worker is live, the ring is running/sleeping with a non-stale owner, exactly one SQ consumer is present, and producer wake/deadline rollback are bounded. The focused make test-scheduler-generic-sqpoll-nohz proof drives eligible entry, producer wake, SQPOLL service, rollback, and stale-owner rejection. Broader AutoUserspacePoller userspace-poller/device-queue issuance remains future policy-service work. depends_on: auto-nohz-sqpoll (landed), scheduler-nohz-network-poll-housekeeping-routing.

Generic full-nohz for arbitrary threads (the kernel half of “auto”):

scheduler-generic-full-nohz-arbitrary-threads – behavior, normal, done 2026-06-06. Ordinary budgeted compute threads can now enter full-nohz through an explicit SchedulingContext-targeted CpuIsolationLease when the single-runnable, budget-deadline, housekeeping, network-poll, IRQ-affinity, timer, lifetime, and rollback gates all pass. Missing thread budget, multiple runnable work, revoked or expired leases, unrouted dependencies, and no-housekeeping cases still fail closed. Issuance is still policy-service future work; this is only the kernel admission half. depends_on: scheduler-cpu-isolation-priority-aware-lease-termination, scheduler-nohz-network-poll-housekeeping-routing, scheduler-nohz-irq-affinity-housekeeping-routing.

Step 17 – user-space AutoNoHz policy service (capstone):

scheduler-autonohz-policy-service-saturation-local-proof – behavior, normal, done 2026-06-07. A userspace AutoNoHz policy-service smoke now holds an operator-declared CpuIsolationPoolGrant, consumes SchedulingPolicyCap.snapshot @2 runtime / runnable / voluntary-block / preemption counters, denies a voluntarily blocking worker, issues a bounded full-nohz lease only after a local saturation window, renews only after re-observing saturation, and proves stopped-renewal expiry leaves fallback periodic scheduling intact. The proof records the grant-stamped account/pool and the single allowed CPU mask that the kernel admitted. depends_on: scheduler-cpu-isolation-runtime-grant-minting, scheduler-cpu-isolation-lease-renewal-on-reobservation, scheduler-cpu-isolation-priority-aware-lease-termination.
scheduler-autonohz-production-policy-daemon – behavior, normal, landed at commit 14d852f3 (2026-07-14 13:23 UTC). The reusable policy daemon (demos/autonohz-policy-daemon) replaced the fixed single-process proof: manifest-declared operator policy (initConfig.init.autonohzPolicy, the generic CueValue plumbing – no schema change) declares per-target grants, saturation window/smoothing profiles, thresholds, lease bounds, and renewal budgets for multiple (account, pool) pairs; the daemon rejects invalid profiles fail-closed, denies a voluntarily blocking target, issues bounded full-nohz leases through two operator-declared grants concurrently, renews only on re-observed saturation, stops renewing at the budget (kernel lifetime expiry fail-closed), revokes explicitly when the signal subsides, and emits bounded one-line decision records. Proof make test-scheduler-autonohz-policy-service (rewritten harness asserts the full decision lifecycle plus the kernel activation/renew/release markers). Cross-process target discovery stays future work: SchedulingPolicyCap.snapshot is per-calling-thread, so the daemon manages threads it spawns. Profile names and grant-cap names use record-safe grammars, invalid entries receive collision-free diagnostic identities, and the daemon’s finite lease-lifetime ceiling is 119 seconds. A stopped-renewal worker remains command-idle under its bounded lease while the daemon polls kernel state; a pre-deadline kernel termination is cleaned up immediately as an ended lease, while an at-or-after-deadline record does not guess the kernel’s release cause. Every terminal path releases its CPU from the userspace candidate ledger. A policy run may therefore finish cleanly with concurrent=false; the QEMU harness owns the capstone fixture’s two-live-lease assertion and pairs its deadline observation with the kernel’s authoritative lease-expired marker. The command-idle stopped-renewal target can remain active while a second lease converges; safety when suppression overlaps relies on the generic scheduler mid-block wake repair recorded immediately below, closed by scheduler-nohz-concurrent-suppression-userspace-corruption at commits 4d64b160 and 35d05df4. depends_on: scheduler-autonohz-policy-service-saturation-local-proof.
scheduler-nohz-concurrent-suppression-userspace-corruption – behavior, high, done 2026-07-14 07:57 UTC. Root cause of the intermittent userspace corruption first seen with two concurrently tick-suppressed CPUs during policy-daemon proof development: a mid-block wake race in the scheduler, generic SMP and not nohz-specific. A thread blocking in cap_enter publishes Blocked (wake-visible) in one scheduler-lock critical section but keeps executing kernel code as current[cpu] until capos_block_current_syscall re-acquires the lock and saves the fresh frame; a cross-CPU wake landing in that unlocked gap wrote the cap_enter return value through Thread.saved_context, which still pointed at the PREVIOUS suspension’s frame. When that previous frame was a timer-preemption frame, its rax slot aliases the in-flight syscall frame’s saved-RBX slot (both frames sit at fixed offsets from the per-thread kernel stack top), so the wake’s CQE count – typically 1 – became the resumed thread’s callee-saved RBX and produced the observed wild write (cr2=0x1 inside capnp arena Vec::push). Double-dispatch through the premature wake enqueue was already blocked by the dispatch-side is_thread_current_on_any_cpu RetryLater candidate gate, which is why the corruption presented as pure userspace state damage. Concurrent nohz suppression only amplified the window: the deferred remote rollback leaves the target CPU tickless with an armed one-shot that can land inside a sibling’s block gap, and its first restored tick runs the wake scan in a burst right after the revoke. Fix: wake return values for a thread still current on some CPU are parked in a per-CPU deferred_mid_block_wake_return record and applied by capos_block_current_syscall immediately after the fresh context save (dropped at schedule()’s save for the tick-path donation-block case, where the thread resumes mid-user-code and its rax is live user state), plus an always-on double-dispatch panic canary in set_current_thread_locked. The exact original fault shape is not reproducible from committed trees (the committed daemon proof’s stealable-sibling-runnable-work rollbacks end each suppression before the sibling activates; a 12-run busy-spin-park campaign at daemon commit 20bd86fb showed zero overlapping suppression windows), so the empirical evidence is a repro-only kernel detector marking every cross-CPU wake that catches its target mid-block – each such hit is a stale-frame write in the unfixed kernel – exercised by the policy-daemon smoke, alongside the analytic frame-offset proof above. Follow-up: if a multi-process shape later sustains two concurrent suppressions, add a standing two-concurrent-suppression regression smoke over it.

Independent hardening (makes auto-nohz budget-safe):

scheduler-deadline-driven-budget-accounting – behavior, normal, done 2026-06-04. Charge SchedulingContext budget at monotonic-deadline granularity rather than per-periodic-tick so an auto-nohz thread cannot overshoot its budget by a full tick quantum while the tick is masked. Closes the “enforcement remains periodic-tick granularity” caveat that auto-nohz made load-bearing; the task record is scheduler-deadline-driven-budget-accounting. depends_on: Phase E budget enforcement (landed), scheduler-lapic-oneshot-subtick-firing-precision (done), scheduler-monotonic-clocksource-subtick-discipline (done).

Cleanup: Retire Benchmark-Driven Scaffolding Before Phase E

This section captures simplification work identified during the post-thread-scale SMP/threading architecture review on 2026-05-01 23:20 EEST. None of these items are regressions: the affected code is correct, gated behind the measure feature where it should be, and was added intentionally during attribution and placement slices that closed the In-Process Threading Scalability milestone. They are recorded here so the next selected scheduler milestone does not extend or formalize speculative SMP scaffolding that the current per-CPU WFQ scheduler does not need.

The cleanup is subordinate to the current selected milestone and to already-open review-finding task records. Pick it up as Phase E preflight work before SchedulingContext claims the scheduler surface. Each removal must preserve the documented runnable-ownership invariants from docs/architecture/scheduling.md (single dispatch owner per live ThreadRef across per-CPU current/handoff_current slots, the per-CPU WFQ run queues, and the direct IPC target; scheduler-lock-contained migration; allocation-free timer/unblock/direct-IPC-fallback/requeue/steal-requeue paths) and the recorded benchmark-only counter policy. The 2026-05-02 per-CPU run-queue collapse and the accepted 2026-05-10 Phase D WFQ reintroduction are now both historical evidence: the single-global-queue shape had accepted 1-to-2 evidence but a 1-to-4 diagnostic gap (capOS 1.566x/1.538x vs Linux 3.963x/3.858x), and Phase D manually accepted the 2026-05-10 per-CPU WFQ 1-to-4 diagnostic (capOS 3.088x/2.700x; matching Linux 3.974x/3.850x on the same pin set) after the harness-enforced 1-to-2 gates stayed green.

Grounding read before any slice:

docs/architecture/scheduling.md
docs/proposals/scheduler-evolution-proposal.md
docs/proposals/smp-proposal.md
docs/backlog/smp-phase-c.md
kernel/src/sched.rs
kernel/src/process.rs
kernel/src/measure.rs
kernel/src/arch/x86_64/{smp.rs,lapic.rs,percpu.rs,tlb.rs}

Acceptance rule for every slice below: each removal must land with a host or QEMU test that fails without it, so a future reintroduction is explicit authority work rather than silent regression of an undocumented feature.

2026-05-02 08:07 UTC: Retired the timer continuation fast path, its per-CPU skip budget, and the slow-path-required mirror flags. Deleted try_continue_current_on_timer_tick, mark_timer_slow_path_required, reset_current_cpu_timer_fast_path_skip_count, note_timer_slow_path_completed_locked (both feature variants), scheduler_has_hard_timer_slow_path_work_locked_excluding_endpoint_queue, scheduler_timer_slow_path_reasons_locked, the TimerBlockedWaiterKind / blocked_thread_* helpers, and the four atomic mirrors TIMER_SLOW_PATH_REQUIRED, TIMER_FAST_PATH_SKIP_COUNTS, CURRENT_NON_IDLE_CPUS, and TIMER_FAST_PATH_MAX_CONSECUTIVE_SKIPS. set_current_thread_locked no longer publishes CURRENT_NON_IDLE_CPUS. The timer interrupt entry in kernel/src/arch/x86_64/context.rs now always calls crate::sched::schedule(context) instead of trying the lock-free fast path. Eight mark_timer_slow_path_required() call sites in kernel/src/sched.rs (run-queue publish, pending process drop, park-with-deadline, process termination queue, direct-IPC handoff, timer sleep enqueue, cap-enter-with-deadline, pending thread stack release, pending endpoint cancellation push) also dropped — they are no-ops once the fast path no longer exists. Verified that make run-spawn exits cleanly ([init] Spawn cap-table exhaustion check ok., proc: process 2 exited with code 0, sched: last process exited, halting) and make run-smoke runs the scripted login flow to operator session. cargo build --features qemu is warning-free (project rule). Reintroduce the fast path only if a future Phase D or Phase F slice ships an evidence pair where it measurably reduces scheduler-lock hold time on a contended SMP run.

Follow-up partial 2026-05-02 08:39 UTC: `kernel/src/measure.rs`
lost the eight public API entry points (`timer_fast_path_attempt`,
`timer_fast_path_continue`,
`timer_fast_path_slow_required_fallback`,
`timer_fast_path_skip_budget_fallback`,
`timer_fast_path_pending_reschedule_fallback`,
`timer_fast_path_no_current_non_idle_fallback`,
`timer_fast_path_inactive_invalid_cpu_fallback`, and
`timer_slow_summary`) plus the now-orphaned `TimerSlowSummaryReasons`
struct and its `requires_slow_path` impl. `cargo build --features
qemu,measure` is back to warning-free.

Follow-up complete 2026-05-02 21:00 UTC: the deeper deletion slice
removed the seven `TIMER_FAST_PATH_*` static counters, the
`TimerCounter::FastPath*` enum variants, the
`TimerSlowSummaryCounter` enum, the `TIMER_SLOW_SUMMARY_*` counter
arrays (`TIMER_SLOW_SUMMARY_COUNTER_VALUES`,
`CASE_START_TIMER_SLOW_SUMMARY_COUNTERS`,
`PREVIOUS_TIMER_SLOW_SUMMARY_COUNTERS`,
`PHASE_TIMER_SLOW_SUMMARY_COUNTERS`), the
`(TimerSlowSummaryCounter, &str)` reporting table, the
`Snapshot.timer_slow_summary_counters` field, and the matching
reset/diff/print helpers and accessors. `TIMER_COUNTER_COUNT`
shrank from 11 to 4 (interrupts, user_scheduler, kernel_only,
bsp_tick_advances). The `measure: timer ...` line is now compact
and the `measure: timer_slow_summary ...` line is no longer
emitted at all. `tools/qemu-thread-scale-harness.sh` dropped the
`fast_path_*` clauses and the `timer_slow_summary` aggregate /
per-phase grep checks in the same slice, satisfying the
"removal must land with a host or QEMU test that fails without it"
acceptance rule. Verified with `make fmt-check`,
`cargo build --features qemu` (warning-free),
`cargo build --features qemu,measure` (warning-free),
`cargo test-lib` (171 passed), `make run-spawn`, and `make
run-measure` (proof line emitted, exit 0). A local one-iteration
`CAPOS_THREAD_SCALE_RUNS=1 CAPOS_THREAD_SCALE_GUEST_MEASURE=1 make
test-thread-scale` was used solely as functional verification of
the harness parser against the new measure-output shape (no CPU
pinning, single iteration; the run reported `qemu taskset cpus:
none` and the resulting medians/speedups are diagnostic only).
This slice is a measure-output cleanup, not a scheduler-structure
change, so it does not require controlled benchmark-VM timing
evidence under the Phase A "before/after each scheduler structure
change" rule; the harness fail-without-the-kernel-change pairing
is the acceptance gate.

2026-05-01 22:01 UTC: Collapsed the asymmetric scheduler CPU sizing. MAX_SCHEDULER_CPUS = 64 was deleted, MAX_SCHEDULER_CLEANUP_CPUS = 4 was renamed to a single SCHEDULER_CPUS = 4, and SchedulerDispatch.current[] resized from 64 to SCHEDULER_CPUS to match run_queues, handoff_current, idle_pids, idle_threads, pending_thread_stack_release, TIMER_FAST_PATH_SKIP_COUNTS, and SCHEDULER_CPU_MASK. The dual current_cpu_slot() / current_cleanup_slot() helpers collapsed into a single current_cpu_slot() that bounds-checks against SCHEDULER_CPUS and panics on overflow with "scheduler: CPU id {} exceeds scheduler-owned mask". scheduler_cpu_slot(cpu_id) -> Option<usize> retained for the non-panicking lookup. The earlier “raw CPU id 0..63 vs scheduler slot 0..3” indexing distinction is gone. Reintroduce a wider id-to-slot mapping only when a Phase D/F slice grows the scheduler-owned mask beyond the current four. Verified with cargo build --features qemu and cargo build --features qemu,measure (both warning-free) plus make run-smoke and make run-spawn on 2026-05-01.
2026-05-02 09:26 UTC: Replaced the per-CPU run-queue array with a single global run_queue: VecDeque<ThreadRef>. SchedulerDispatch keeps run_queue_live_reservations as a single counter; the reserve_run_queue_capacity_for_thread_locked / release_run_queue_capacity_reservations_locked / push_reserved_run_queue_locked triple still bounds growth but operates on the single queue. enqueue_ready_thread_on_cpu_locked, run_queue_target_cpu_locked, the created_thread_target_cpu_locked placement chain (active_ready_scheduler_cpu_mask, non_idle_dispatch_load_locked, least_loaded_scheduler_cpu_*, caller_current_scheduler_cpu_slot_locked), the CreatedThreadPublishPolicy / CreatedThreadTarget types, the scheduler_cpu_scan_order helper, and the crate::measure::thread_placement_publish_caller_* reporting surface are all gone. WakePolicy::QueueCpu(usize) collapsed to WakePolicy::QueueAny. wake_idle_scheduler_cpus_locked walks eligible idle scheduler CPUs and stops only after the first one that accepts a fresh reschedule IPI; CPUs that already have a pending IPI (or that fail LAPIC delivery) are skipped without breaking, so a burst of ready work cross-wakes more than one neighbor for both queue and direct-target wakes. publish_created_thread no longer takes a caller_thread argument and no longer emits a per-CPU placement record: under the single global queue there is no per-CPU publish target, and hard-coding CPU0 misclassified normal worker publishes as single-owner-CPU0. Phase D later reintroduced the per-CPU split without restoring those publish counters; reintroduce them only through a separate operator-observability slice.
```
Verified with `cargo build --features qemu` and `cargo build
--features qemu,measure` (both warning-free) plus `make run-spawn`
and `make run-smoke`. A post-collapse 3-run diagnostic
`make test-thread-scale` on the benchmark VM (`taskset 0,1,2,3`,
enforcement disabled) on 2026-05-02 10:42 UTC measured
1-to-2 work/total `1.890x`/`1.792x` (slight improvement over the
pre-collapse 1-to-2) and 1-to-4 work/total `1.504x`/`1.436x`
(clear regression vs the pre-collapse 1-to-4): single-queue
scheduler-lock contention dominates at 4 workers. The numbers
live in `docs/benchmarks.md` as diagnostic. Phase D later
brought per-CPU queues back with a fair-share enqueue policy and
formal accepted evidence (capOS plus Linux baseline, full
enforcement, multiple runs, recorded host caveats).
```

2026-05-02 07:00 UTC: Lifted endpoint-cancellation retry storage out of the scheduler lock. The pending_endpoint_cancellations: VecDeque field is gone from Scheduler; it now lives in a dedicated static PENDING_ENDPOINT_CANCELLATIONS: Lazy<Mutex<VecDeque<...>>> with bounded try_reserve_exact(MAX_PENDING_ENDPOINT_CANCELLATIONS) reservation, eagerly forced in init_idle via Lazy::force so the allocation never lands in a timer/exit cleanup path. The queue’s len() under its own mutex is the single source of truth for pending_endpoint_cancellations non-emptiness. Producers (queue_pending_endpoint_cancellation, remove_pending_endpoint_cancellations_for_pid, remove_pending_endpoint_cancellations_for_thread) and the drain (drain_pending_endpoint_cancellations) take only the queue mutex; the scheduler lock is acquired only briefly inside queue_pending_endpoint_cancellation to validate the target thread is live and has a ring scratch. defer_endpoint_cancellation previously re-acquired the scheduler lock just to push to the fallback queue; that re-acquisition is gone.

`note_timer_slow_path_completed_locked` (consumer) holds the queue
mutex across both the `!is_empty()` check and the
`TIMER_SLOW_PATH_REQUIRED.store`, and the producer
`queue_pending_endpoint_cancellation` stores
`TIMER_SLOW_PATH_REQUIRED = true` inside the queue lock alongside
its push, so a concurrent producer cannot push between the
consumer's read and store and have its slow-path mark be overwritten.

The functional contract is preserved: a cancellation that cannot
deliver immediately because the target ring scratch is contended
still falls back to the bounded retry queue, still raises
`TIMER_SLOW_PATH_REQUIRED`, and is still drained on the next
scheduler tick. Bound is unchanged
(`MAX_PENDING_ENDPOINT_CANCELLATIONS = MAX_CAP_SLOTS *
MAX_ENDPOINT_CANCELLATION_OBJECT_SWEEPS *
MAX_ENDPOINT_CANCEL_NOTIFICATIONS_PER_ENDPOINT * SCHEDULER_CPUS`);
the open size-tightening question (whether the `SCHEDULER_CPUS`
multiplier is still load-bearing now that producers no longer hold
the scheduler lock) is deferred to a future slice with bench evidence.

A possible follow-on slice would move retry storage to per-endpoint
bounded slots so each endpoint object owns its own queue, but that
requires reshaping the `(thread, user_data)` payload to be addressable
from an endpoint object and is non-trivial. The current move is
sufficient to get the storage out of the scheduler lock and unblock
future scheduler-lock-hold-time analysis.

Verified with `cargo build --features qemu` and
`cargo build --features qemu,measure` (both warning-free) plus
`make run-spawn` and `make run-smoke` on 2026-05-02. Review found and
fixed a Lazy-init in interrupt paths and a slow-path-clearing race
against producer publication.

2026-05-01 21:38 UTC: Feature-gated the first ThreadCpuAccounting experiment end-to-end behind cfg(feature = "measure"). That slice temporarily compiled the whole accounting record, its accessors, and scheduler call sites only when the feature was enabled. Phase D later superseded this temporary shape: runtime_ns, virtual_runtime_ns, and last_started_ns are now unconditional normal-build fields because WFQ ordering, SchedulingPolicyCap.snapshot, and SchedulingContext budget charging depend on them. The remaining diagnostic counters (context_switches, preemptions, voluntary_blocks, migrations, last_cpu, blocked/exited stability observations, placement buckets, and per-phase attribution counters) stay behind cfg(feature = "measure"). The 2026-05-01 slice was verified with cargo build --features qemu and cargo build --features qemu,measure (both warning-free) plus make run-spawn (non-measure default) on 2026-05-01. make run-measure was broken on main at the time of this slice for unrelated reasons; that regression was repaired on 2026-05-02 20:23 UTC (see docs/backlog/scheduler-evolution.md and the docs/changelog.md Measure Mode Repair entry).
2026-05-01 21:02 UTC: Retired the RUNNABLE_PROCESS_EXIT_CLEANUP_PROOF_PRINTED, RUNNABLE_THREAD_EXIT_CLEANUP_PROOF_PRINTED, and CPU_ACCOUNTING_PROOF_PRINTED once-flag log lines along with their Atomic* gating booleans, the three print_*_once / maybe_print_*_for_thread_locked helpers in kernel/src/sched.rs, and their four call sites. The runnable-cleanup invariants remain enforced by the unconditional assert_no_runnable_pid_entry_locked and assert_no_runnable_thread_entry_locked panics already in kernel/src/sched.rs; a regression that leaves stale runnable owner state still panics the kernel and fails make run-spawn. The tools/qemu-spawn-smoke.sh harness lost its three matching grep -Fq lines for the same reason. The orphaned Process::account_thread_exited_stable_observed / ThreadCpuAccounting::observe_exited_stable helpers were deleted with the print; the remaining ThreadCpuAccounting writes stay untouched for the upcoming feature-gate slice. The pub fn thread_cpu_accounting accessor moved behind cfg(feature = "measure") because its only remaining caller is the measure-gated account_thread_selected_locked placement counter bridge.
Cache the active CPU id in the per-CPU GS-relative slot. arch::percpu::current_cpu_id reads the LAPIC ID MMIO register and then linearly scans CPU_LAPIC_IDS[0..64] on every call. The timer fast-path consumer was retired on 2026-05-02 (see the “Retired the timer continuation fast path” entry above), but the function still runs from the syscall path and from non-syscall kernel contexts: arch::context::advance_bsp_tick, the scheduler’s CPU-slot accounting and dispatch lookups in sched.rs, arch::tlb::flush_pending_for_current_cpu, and mem::paging invalidation paths. The hot caller is the syscall entry path; the non-syscall callers are why a drop-in GS-relative replacement is harder than the cleanup item first suggested. The single-mov lookup conceptually wants mov %gs:offset, %eax, but the slice is blocked on a kernel-mode GS-base invariant: today the kernel sets KernelGsBase via set_kernel_gs_base and only the syscall assembly does swapgs to make gs:0..16 resolve at PerCpu while handling a syscall. In normal kernel context (timer ISR, scheduler from non-syscall paths, paging init, AP bring-up), the active GS base is whatever Limine left, not the PerCpu address. A drop-in replacement of current_cpu_id with gs:[offset] therefore faults outside syscall context (verified 2026-05-02: reordering init_bsp to set KernelGsBase before set_kernel_entry_stack is necessary but not sufficient because the active GS base is still not the PerCpu address). The enabling work is establishing a kernel-mode invariant that GS_BASE = PerCpu in CPL0 (typically by swapgs-ing on every kernel entry/exit, including interrupt handlers), or by adopting a hybrid: GS-relative read in the syscall path plus the existing LAPIC-based path everywhere else. Both paths are larger than a single retirement slice and should land with their own gates. Until then this item stays open and current_cpu_id keeps the LAPIC MMIO + CPU_LAPIC_IDS scan.
2026-07-18 19:54 UTC: Reassessed the scheduler-lock-site instrumentation breadth after Phases A-F. Measure builds retain five policy-relevant classes: pre-ring scheduling separates SQPOLL/nohz ring work, selection covers the locked WFQ choice path, blocking and wake/unblock preserve latency/donation attribution, and metadata records control-plane policy pressure outside dispatch transitions. Generic acquisition, process exit, thread exit, and startup/idle selection now fold into the aggregate measure: scheduler_lock counters; process and thread exit retain their existing focused segment counters. Non-measure builds still take SCHEDULER.lock() directly. The measure smoke enforces the narrowed output contract, and the optional thread-scale parser uses the same retained-site list.
2026-07-18 19:54 UTC: Reassessed single_cpu_owner_pids, direct_ipc_target, and handoff_current after Phase E had already landed. All three remain load-bearing rather than obsolete scaffolding: single-owner pinning contains process-wide endpoint/launch authority that lacks explicit SMP ownership; the direct-IPC slot is the bounded, generation-checked receiver preference before WFQ fallback; and the per-CPU handoff slot preserves the one-runnable-owner invariant during a context switch and participates in nohz runnable-entity admission. Their field comments now state those roles; deletion remains coupled to the corresponding ownership replacement rather than scheduler cleanup.

Keep an honest scaling proof when scheduler work resumes. Completed 2026-05-02 21:38 UTC on the benchmark VM against main commit 374f8556. Five-run controlled paired evidence, both runs pinned to physical-core logical CPUs 0,1,2,3 on a 4-core/8-thread n2-highcpu-8 host with KVM:

| Comparison | capOS | Linux pthread | capOS gate | capOS verdict |
| --- | ---: | ---: | ---: | --- |
| 1→2 work  | `1.883x` | `1.988x` | ≥ `1.6x` | accepted |
| 1→2 total | `1.787x` | `1.987x` | ≥ `1.6x` | accepted |
| 1→4 work  | `1.566x` | `3.963x` | ≥ `1.6x` | diagnostic |
| 1→4 total | `1.538x` | `3.858x` | ≥ `1.6x` | diagnostic |

Linux scales near-linearly on the same physical CPU set (1-to-2
`1.99x`, 1-to-4 `3.96x`), so the workload shape is sound and the
capOS 1-to-4 gap is a scheduler bottleneck, not a benchmark
artifact. The 1-to-2 result was the formal accepted gate against
the single-global-queue scheduler. The 1-to-4 result became the
bottleneck-attribution diagnostic that justified Phase D's fair-share
enqueue policy; Phase D later manually accepted the `2026-05-10` WFQ
1-to-4 diagnostic pair recorded above while the harness-enforced gates
remained the 1-to-2 work/total speedups.

Benchmark shape: blocking parent join, 262,144 blocks (16 MiB),
`work_rounds=64`, 5 runs per case (the capOS harness default is 3
runs; this collection explicitly set `CAPOS_THREAD_SCALE_RUNS=5`
for parity with the Linux baseline default). Host caveats:
internal benchmark VM in a single GCP zone, status `RUNNING`
during collection, machine `n2-highcpu-8` with nested
virtualization enabled, `/dev/kvm` readable+writable without
sudo, SSH operator account, kernel `Linux 6.17.0-1012-gcp
x86_64`, CPU `Intel(R) Xeon(R) CPU @ 2.80GHz`, distinct
physical-core layout (logical CPUs 0-3 are core IDs 0-3 thread
0; logical CPUs 4-7 are the SMT siblings), `qemu-system-x86_64
8.2.2`, `rustc 1.97.0-nightly (c935696dd 2026-04-29)`.

Exact commands:

```sh
# capOS
PATH="$HOME/.cargo/bin:$PATH" \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1 \
  CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1 \
  CAPOS_THREAD_SCALE_TIMESTAMP=20260502T213544Z \
  make test-thread-scale

# Linux pthread baseline
PATH="$HOME/.cargo/bin:$PATH" \
  LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
  LINUX_THREAD_SCALE_RUNS=5 \
  LINUX_THREAD_SCALE_TIMESTAMP=20260502T213445Z \
  make test-linux-thread-scale-baseline
```

Raw artifacts on the benchmark VM at
`target/thread-scale/20260502T213544Z/` and
`target/linux-thread-scale/20260502T213445Z/`. The instance was
stopped after collection.

Keyboard shortcuts

capOS Documentation