Scheduler Evolution Backlog
This backlog decomposes future scheduler architecture from
Scheduler Evolution. It also
retains the completed attribution and placement history that closed the
In-Process Threading Scalability milestone; new selected-milestone work now
continues from docs/tasks/README.md.
Design Grounding Checklist
Before implementation slices, read:
docs/architecture/scheduling.mddocs/backlog/smp-phase-c.mddocs/proposals/smp-proposal.mddocs/proposals/ring-v2-smp-proposal.mddocs/proposals/tickless-realtime-scheduling-proposal.mddocs/proposals/stateful-task-job-graphs-proposal.mddocs/proposals/scheduler-evolution-proposal.mddocs/proposals/system-performance-benchmarks-proposal.mddocs/proposals/hpc-parallel-patterns-proposal.mddocs/research/future-scheduler-architecture.mddocs/research/nohz-sqpoll-realtime.mddocs/research/out-of-kernel-scheduling.mddocs/research/completion-ring-threading.mddocs/research/hpc-parallel-patterns.md
For realtime or isolation slices, also read:
docs/research/multimedia-pipeline-latency.mddocs/research/robotics-realtime-control.mddocs/research/x2apic-and-virtualization.md
Phase A: Attribution and Guardrails
- Finish first-pass thread-scale attribution guardrails. Scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock, timer interrupt, CR3/TLB, raw guest-PC sample, logging-suppression A/B, exact Linux pthread baseline, compact-versus-padded result-slot diagnostic, and larger-workload/Amdahl evidence now exist. The evidence does not identify the primary remaining non-scaling cause; it keeps per-CPU runnable ownership, accepted threshold-passing work/total evidence, and optional symbolic attribution as follow-on work.
- Add bounded scheduler-lock site attribution before a structural lock
split. As of 2026-05-01 09:52 UTC, measure builds keep the compatible
aggregate
scheduler_lockline and also emit aggregate plus per-phasescheduler_lock_sitecounters for generic, timer pre-ring, timer select, blocking, process exit, thread exit, start/idle selection, wake/unblock, and metadata classes. This is split-prep attribution only; it does not accept the in-process thread-scale milestone. - Add timer-fast-path attribution for the bounded continuation path. As of
2026-05-01 10:58 UTC, measure builds extend the aggregate and per-phase
timercounter lines with fast-path attempts, continues, and fallback reasons for slow-required/dirty summaries, skip-budget exhaustion, pending reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid scheduler CPUs. The thread-scale harness requires those fields only forCAPOS_THREAD_SCALE_GUEST_MEASURE=1. This is attribution only; it does not change scheduler behavior and does not close the currentaccepted=falsework or total gates. Local one-run evidence intarget/thread-scale/20260501T110157Z/passed with the new fields present in every 1/2/4-threadmeasure.log; the timed work phase recordedfast_path_continues=0for all three rows. - Add timer slow-summary reason attribution for dirty fast-path summaries.
As of 2026-05-01 11:28 UTC, measure builds emit aggregate and per-phase
timer_slow_summarylines with required/clean counts plus reason fields for nonempty run queues, direct IPC targets, handoff-current state, pending process termination/drop/stack release, timer sleeps, and timed cap-enter versus park waiters. The harness requires those lines only forCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run evidence intarget/thread-scale/20260501T112359Z/passed with the new lines present in every 1/2/4-threadmeasure.log; the timed work phase reported dirty summaries attributable torun_queue_nonemptyandhandoff_currentonly, withrequired=2/4/8,clean=0, and timer sleeps/timed waiters at zero for the 1/2/4-thread rows. The subsequent fairness-only behavior slice keeps the same fields, butrequirednow means direct IPC, deferred cleanup, timer sleeps, or timed waiter work still force the next locked timer pass. - Complete thread-scale shared-kernel-state contention attribution
guardrails beyond the first measure-only lock-counter slice. As of
2026-05-01 08:07 UTC,
CAPOS_THREAD_SCALE_GUEST_MEASURE=1emits aggregate and per-phaseshared_kernel_lockcounters for frame allocator alloc/free locks, ring-dispatch cap-table and ring-scratch locks beforecap::ring::process_ring, endpoint inner/cancellation scratch locks, direct per-process address-space locks, and heap allocator locks. As of 2026-05-01 08:29 UTC, fresh thread-scale rows also carry explicit benchmark-class fields and the harness requires, validates, and exports those fields toresults.csv; local one-run evidence is retained intarget/thread-scale/20260501T083254Z/. As of 2026-05-01 08:49 UTC, guest-measure runs also emit and require aggregate and per-phasenetwork_pollcounters for initialized virtio-net scheduler/runtime/interface polling, the built-in TCP HTTP proof poll, virtqueue poll spins and completions, and pending network waiter scans. Local one-run evidence intarget/thread-scale/20260501T093505Z/passed and retained zero aggregate and per-phase network/poll counters for the 1/2/4-thread rows. The default thread-scale manifest has no virtio-net device, and the scheduler poll entry returns before the driver mutex in that no-device case. Those counters are expected zero-evidence for the CPU-bound thread-scale benchmark. They do not prove service throughput; future service/network benchmarks still need their own hot-section attribution and acceptance evidence. - Add a benchmark-kernel mode that suppresses per-context-switch logging
during measured cases so serial MMIO cannot masquerade as scheduler cost.
Completed with
CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1; benchmark proof/error output and measure lines remain enabled. - Decide which counters are permanent observability and which stay behind
measure. Completed 2026-05-01 04:55 UTC indocs/architecture/scheduling.md: all existingkernel/src/measure.rscounters remain benchmark-only behind themeasurefeature. Permanent scheduler observability should be added later through a separate low-overhead operator snapshot surface after the Phase C runtime accounting ledger exists, starting with runtime, context-switch, preemption, voluntary-block, migration, queue-depth, reschedule-IPI, TLB-shootdown, and policy admission/denial counts. Phase/cycle attribution, scheduler-lock wait/hold cycles, serial byte attribution, timer/TLB benchmark totals, raw user-PC samples, and thread-scale phase checkpoints stay behindCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Grounding read:docs/architecture/scheduling.md,docs/proposals/scheduler-evolution-proposal.md,docs/research/future-scheduler-architecture.md,docs/research/out-of-kernel-scheduling.md,docs/research/nohz-sqpoll-realtime.md, anddocs/research/completion-ring-threading.md. - Record controlled benchmark-VM evidence before and after each scheduler
structure change.
Latest follow-up after the first Phase C runtime-accounting slice reran
the in-process thread-scale diagnostic at main commit
a88e7906with QEMU pinned to physical-core logical CPUs0-3and SMT logical CPUs0-7. All rows remainedaccepted=false: physical 1/2/4 work speedups were1.000xand0.999x, and SMT 1/2/4/8 work speedups were1.000x,1.001x, and0.333x. Follow-up after the total-speedup host-summary gate landed reran currentmaincommitf198b099on the benchmark VM with QEMU pinned to0-3and0-7. The harness now reports total-speedup diagnostics explicitly: physical 1/2/4 work speedups were1.002xand1.002x, total speedups were0.911xand0.601x; SMT diagnostic 1/2/4/8 work speedups were1.001x,0.998x, and0.333x, total speedups were0.913x,0.621x, and0.200x. Both host-summary gates remain unsatisfied.
Phase B: Per-CPU Runnable Ownership
-
Land the first bounded per-CPU runnable queue slice. Commit
1a8bf909replaces the single global schedulerVecDequewith four per-scheduler-CPU FIFO queues under the existing global scheduler lock, centralizes enqueue/requeue/removal helpers, keeps single-owner capability processes on CPU0, prefers local work before bounded stealing, preserves direct IPC preference, and removes stale runnable entries for process/thread exit. Review fixes track live run-queue reservations, reserve all per-CPU queues to that count before publishing a new runnable thread, and release reservations on process/thread exit or pre-publication rollback, keeping timer and unblock requeue paths allocation-free after cross-CPU steals. Verification coveredrun-spawn,run-smp2-smokes, and controlled benchmark-VM 1/2/4/8-thread diagnostics. The default workload and total-case 64 MiB rows remainaccepted=false, so this is structure evidence, not milestone closeout. -
Finish
PerCpuRunQueueownership invariants as a documented contract. Completed 2026-05-01 02:13 UTC indocs/architecture/scheduling.md: a live generation-checkedThreadRefhas at most one runnable dispatch owner across current slots, per-CPU run queues, and the direct IPC target; migration is a scheduler-lock-contained remove-before-publish transfer; local-first stealing is bounded by the scheduler CPU slots; and live run-queue reservations keep timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths allocation-free. -
Split current-thread and runnable ownership from shared process/thread metadata without widening emergency-path allocation. Completed 2026-05-01 04:22 UTC in commit
d7221648:Scheduler::processesremains the shared process/thread metadata table, whileSchedulerDispatchnow owns per-CPU run queues, current and handoff slots, idle slots, the direct IPC target, run-queue reservation count, pending process drops, and pending thread stack releases. The existing global scheduler lock and generation checks are unchanged, and the dispatch split keeps the pre-reserved run-queue capacity model for timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths. Verification passedmake fmt-check,cargo build --features qemu, a cachedmake run-spawnrerun, andmake run-smp2-smokesintarget/smp2-smokes/20260501T042343Z/. Controlled benchmark-VM timing after merge56458b12stayedaccepted=false:| Pinning | Workers | Work Median | Total Median | Work Speedup | Total Speedup | | --- | ---: | ---: | ---: | ---: | ---: | | physical `0-3` | 1 | `56275842` | `140953762` | `1.000x` | `1.000x` | | physical `0-3` | 2 | `56290542` | `153327094` | `1.000x` | `0.919x` | | physical `0-3` | 4 | `56315094` | `237018874` | `0.999x` | `0.595x` | | SMT `0-7` | 1 | `56258010` | `140620194` | `1.000x` | `1.000x` | | SMT `0-7` | 2 | `56313324` | `153367860` | `0.999x` | `0.917x` | | SMT `0-7` | 4 | `56352472` | `237971426` | `0.998x` | `0.591x` | | SMT `0-7` | 8 | `169006414` | `727393630` | `0.333x` | `0.193x` | -
Add a bounded timer continuation fast path before a broader scheduler lock split. Completed 2026-05-01 10:29 UTC: user-mode LAPIC timer ticks can continue the current non-idle thread without calling
sched::schedule()only when a previous locked timer slow path published a clean hard-work summary, the current CPU is a valid active scheduler slot, no reschedule IPI is pending for that CPU, and the per-CPU one-skip budget is not exhausted. Dirty producers still force at least one locked pass before bypass, but the 2026-05-01 11:40 UTC follow-up lets that pass classify remaining nonempty run queues and handoff-current markers as fairness/protection-only state. Direct IPC targets, deferred termination/drop/stack cleanup, Timer sleeps, and timed cap-enter/Park waiters still keep the hard slow-path bit set; ordinary ring SQEs and indefinite cap wait scans are still serviced by forced slow-path ticks. This is a correctness-first split-prep slice, not a replacement for narrower scheduler metadata locks or accepted thread-scale evidence. Controlled benchmark-VM physical-core0-3before/after runs for the initial strict-clean version retainedaccepted=false: baselinetarget/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/recorded work speedups0.998xand0.998xplus total speedups0.907xand0.620x; after-changetarget/thread-scale/timer-fastpath-after-physical-20260501T104700/recorded work speedups1.001xand0.999xplus total speedups0.909xand0.602x. Controlled benchmark-VM physical-core0-3before/after runs for the fairness-only follow-up stayedaccepted=false: baselinetarget/thread-scale/20260501T120224Z/recorded work speedups1.001xand0.999xplus total speedups0.913xand0.587x; after-changetarget/thread-scale/20260501T120709Z/recorded work speedups1.001xand1.000xplus total speedups1.125xand0.828x. -
Add cross-CPU wake policy for endpoint, timer, park, process wait, and thread join completions. Completed 2026-05-01 03:06 UTC: queued wakeups now target the selected per-scheduler-CPU FIFO owner instead of scanning all idle scheduler CPUs.
-
Add explicit placement evidence and placement policy for newly runnable same-process worker threads. Completed 2026-05-01 12:37 UTC, refined 2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the blocking-parent benchmark exposed a placement regression. Measure builds emit aggregate and per-phase
thread_placementlines with single-owner publish buckets, normal publish buckets, caller-current publish buckets, caller-aware avoid, fallback, and strict-load fallback counts, selected CPU buckets, first-selected CPU buckets, and migration totals/targets for CPU slots 0-3.publish_created_thread()receives the caller thread fromThreadSpawner.create, keeps single-owner processes on CPU0, and avoids the caller’s current CPU only when another active ready scheduler CPU has a strictly lower non-idle dispatch load. On equal load, an active-ready caller CPU wins the tie instead of falling through to CPU0-biased least-loaded scanning; if the caller slot is unknown or ineligible, publication falls back to the least-loaded active scheduler CPU behavior. Timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths keep their existing allocation-free targeting behavior.The earlier avoid-caller policy passed the old spinning-parent 1-to-2 gate but failed the repaired blocking-parent shape: before the strict-load fix, controlled capOS evidence regressed to 1-to-2 work/total speedups `0.886x`/`0.928x` because children were biased onto the non-caller queue even when the caller CPU had equal load. The repaired benchmark shape uses blocking parent join, 262,144 blocks (16 MiB), and `work_rounds=64`. The matching Linux baseline scales on the selected physical CPU set with 1-to-4 work/total speedups `3.958x`/`3.834x`. Controlled capOS evidence on the same CPU set passed the enforced 1-to-2 work/total gates with `1.828x`/`1.687x`; the unsuppressed 1-to-4 diagnostic recorded `3.029x`/`2.386x`, and scheduler-switch-log-suppressed diagnostics recorded `3.272x`/`2.303x`. Remaining four-worker limits are now scheduler implementation issues, not benchmark-shape excuses: serial switch logging, global `Scheduler` lock contention, total-time exit/join/block/schedule overhead, and the temporary four-owner CPU mask. -
Add bounded reschedule IPI behavior for idle-to-runnable transitions. Completed 2026-05-01 03:06 UTC: queued wakeups target at most one queue-owner CPU, direct IPC targets at most one eligible idle scheduler CPU, and measure builds emit wake scan, eligible idle CPU, target, sent, pending-skip, not-ready-skip, missing-target, and failure counters.
-
Preserve direct IPC handoff as a scheduling preference without bypassing per-CPU ownership or generation checks. Completed 2026-05-01 03:06 UTC: direct IPC still uses the single preference slot when available and falls back to the normal queued owner path when the target cannot run directly.
-
Prove process/thread exit cleanup cannot leave a stale runnable entry on any CPU queue. Completed 2026-05-01 03:14 UTC: process termination, current-process exit, and
ThreadControl.exitThreadcleanup now assert under the scheduler lock that the exiting process or thread no longer appears in any per-scheduler-CPU FIFO or in the direct IPC target slot. The focused spawn smoke asserts the serial proof markers emitted by the exercised process/thread exit paths. -
Rerun
make run-thread-scale,make run-smp2-smokes, ordinary smoke, spawn/thread, park, ring, and process-exit focused proofs. Completed 2026-05-01 04:18 UTC: local serial reruns passed normalmake run-thread-scaleintarget/thread-scale/scheduler-phaseb-rerun-local-normal-20260501T034800Z/andmake run-smp2-smokesintarget/smp2-smokes/20260501T034414Z/. Controlled benchmark-VM reruns at main commit87be6e25pinned QEMU to physical-core logical CPUs0-3and SMT logical CPUs0-7; all rows remainedaccepted=false, so this closes the Phase B rerun-evidence gate but not the selected in-process speedup milestone.
Phase C: CPU Accounting
- Add monotonic runtime charge points when a running thread leaves the CPU
at context switch, preemption, blocking syscall, direct IPC handoff, and
thread exit. Completed 2026-05-01 05:08 UTC: running intervals are
charged with
crate::arch::context::monotonic_ns()when a current thread stops running through timer preemption, blockingcap_enter/ParkSpace, thread/process exit, and direct switch or handoff paths that select the next current thread. - Observe blocked runtime stability at unblock without charging non-running time. Completed 2026-05-01 05:08 UTC: unblock paths check the blocked runtime snapshot before making the thread ready.
- Track per-thread runtime, virtual runtime seed, context switches,
preemptions, voluntary blocks, and migrations. Completed 2026-05-01
05:08 UTC:
ThreadCpuAccountingis stored on eachThreadrecord and updated under the scheduler/process lock. Context switch counters increment when a thread is selected, preemptions increment only for timer-driven running-to-ready requeue, voluntary blocks increment for blockingcap_enterand ParkSpace waits, and migrations increment when a thread runs on a different scheduler CPU than its previous run. - Add process/session/service aggregation only after the per-thread record
has a single ledger of record. Completed 2026-05-22 13:50 UTC: a
per-
ProcessProcessCpuAccountingledger sumsruntime_nsand a process-levelcontext_switchesdispatch count incrementally at the same scheduler/process-lock charge points that updateThreadCpuAccounting, so it captures exited threads’ contributions. Only the always-present (non-measure) per-thread quantities are rolled up; the measure-gatedpreemptions/voluntary_blocks/migrationscounters are intentionally not aggregated so the default-build proof stays meaningful. The kernel emits asched: process_cpu_accounting pid=... runtime_ns=... context_switches=...line at per-process exit andmake run-spawnasserts a nonzero aggregate. Session/service aggregation remains a stretch follow-on. - Add tests or QEMU diagnostics proving runtime increases while running and
stops while blocked. Completed 2026-05-01 05:08 UTC:
make run-spawnnow asserts a compact scheduler proof line that requires nonzero runtime, context switches, preemptions, and voluntary blocks, plus stable blocked and exited runtime observations. - Keep runtime accounting independent of tickless idle by using the
monotonic clocksource layer. Completed 2026-05-01 05:08 UTC: normal
accounting uses
monotonic_ns()and does not readkernel/src/measure.rscycle counters.
Phase D: Best-Effort Fair Scheduling
Phase D accepted its Task 6 diagnostic closeout at commit 77caafc0
(2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate)
and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC,
docs(scheduler): close phase d). The first
Phase D policy is weighted fair queueing on top of the existing
per-thread runtime_ns / virtual_runtime_ns accounting, with a
capability-authorized SchedulingPolicyCap for weight and latency-class
mutation. The controlled Task 6 benchmark pair passed the harness-enforced
1-to-2 work/total gates; capOS recorded 1-to-4 work/total diagnostics
3.088x / 2.700x at 4 workers versus the prior single-global-queue baseline
1.566x / 1.538x, and that 1-to-4 row was manually accepted for Phase D
closeout. The matching Linux pthread baseline on the same host and
physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. EEVDF is
now a follow-on policy evaluation, not a Phase D blocker. The design content is
in
docs/proposals/scheduler-evolution-proposal.md “Phase D
first-policy decision”, “Phase D capability surface”, “Phase D
migration fairness sketch”, “Phase D test matrix”, and “Phase D
overload behavior” sections. The completed implementation plan is
archived at docs/backlog/scheduler-evolution.md.
The bullets below retain the closed acceptance gates and the
Phase D follow-ons that should be selected explicitly. Phase E
SchedulingContext is the next scheduler authority phase, followed
by Phase F auto-nohz / SQPOLL / tickless idle; generic full-nohz
remains deferred behind those prerequisites.
- Choose initial weighted-fair or EEVDF-like policy based on accounting and
queue data. Resolved
2026-05-05 19:00 UTC: WFQ first; EEVDF deferred. Seedocs/proposals/scheduler-evolution-proposal.md“Phase D first-policy decision”. - Add scheduler entity weights and latency class metadata through a
capability-authorized policy path, not ambient process fields.
Closed by
docs/backlog/scheduler-evolution.mdTasks 1-2:SchedulingPolicyCapschema + kernel cap, per-threadweight/latency_classfields, weighted vruntime, and caller-thread cap binding. - Preserve fairness across CPU migration. Implementation tracked in
docs/backlog/scheduler-evolution.mdTask 4 (vruntime travels with the thread,virtual_finish_nsrecomputed at destination enqueue, bounded steal targets the queue whose head has the lowestvirtual_finish_ns, matching the local pick rule of taking the front of the ascending per-CPU queue). Closed2026-05-08 00:53 UTC: invariants made explicit onrefresh_virtual_finish_ns_lockedand at the steal-insert site; thecfg(feature = "measure")-gatedThreadCpuAccounting.migrationscounter moved from the dispatch-timescheduled_measurepath to enqueue-timerecord_placement_spread_migration_lockedandrecord_steal_migration_lockedarms; weight-change-while- enqueued contract proved by construction with adebug_assert!reinforcement inProcess::refresh_thread_virtual_finish_ns. - Test CPU hogs, short sleepers, direct IPC server/client pairs,
multi-process load, and same-process sibling load. Implementation
tracked in
docs/backlog/scheduler-evolution.mdTask 5 (test matrix smokes) and Task 6 (the controlledmake run-thread-scaleevidence pair: harness-enforced 1-to-2 gates plus a manually accepted 1-to-4 diagnostic closeout row). Closed2026-05-10 19:46 UTC: the benchmark-VM Task 6 run at commit76025f0963a4recorded capOS 1-to-4 work/total diagnostics3.088x/2.700x; the 1-to-2 gate stayed green at1.809x/1.774x. The matching Linux pthread baseline on the same physical-core logical CPUs0,1,2,3recorded3.974x/3.850x. - Define overload behavior when runnable entities exceed the selected CPU
set or when migration cannot keep up. Resolved at the design
level
2026-05-05 19:00 UTC: soft overload uses vruntime ordering (no entity is starved); hard overload defers to Phase FCpuIsolationLeaseand Phase GRealtimeIsland. Seedocs/proposals/scheduler-evolution-proposal.md“Phase D overload behavior”. - Phase D follow-on: EEVDF migration. Once the WFQ slice has
accepted thread-scale evidence, evaluate replacing the bucketed
per-CPU
VecDequewith an EEVDF eligibility set (BTreeMap-by-virtual-deadline) plus per-thread request size and lag accounting. The accounting fields, capability surface, and migration contract carry directly; the change is localized to the dispatch ordering structure. Promote to its own design slice if and when selected; do not bundle it into the WFQ first-slice plan.
Phase E: SchedulingContext Capability
Phase E policy follow-ups are closed. Local owner-shell logout propagation is
recorded in
scheduler-phase-e-local-owner-shell-logout-propagation.
Endpoint donation/return, timeout/depletion notifications, and the
scheduler-observable session lifecycle hook are recorded on main:
scheduler-phase-e-endpoint-donation,
scheduler-phase-e-timeout-depletion-notifications, and
scheduler-session-lifecycle-hook.
The donated-context logout policy is also closed as a conservative
counted/skipped return-path proof:
scheduler-phase-e-session-logout-donated-context-policy.
Timeout/depletion notifications now use fixed per-context notification cells
allocated at context creation/bootstrap. The ordinary non-donated
session-logout stale-context proof is complete through the
UserSession.logout() hook. In-flight endpoint donation uses the conservative
counted/skipped policy during logout and relies on endpoint RETURN/cancel to
finish the in-flight transfer/clear without returning donor budget early. Local
owner-shell exit now calls the same UserSession.logout() path on clean REPL
exit or terminal-close completion; the shell proof observes the scheduler hook
with no bound local shell SchedulingContext, while the focused
session-context proof remains the ordinary bound-context stale evidence.
- Phase E preflight: retire the transitional
CAPOS_SCHED_DISABLE_WFQ=1/WakePolicy::QueueAnysingle-global-queue fallback that Phase D kept for one bisect cycle. This is a scheduler-surface cleanup beforeSchedulingContextclaims budget/period authority; do not treat it as an EEVDF blocker. Completed 2026-05-10 22:20 UTC: the source-level opt-out, queue-0 enqueue funnel, andQueueAnywake policy are gone. - Define the first
SchedulingContextobject shape. Phase E Task 1 adds the minimal schema/control-plane cap shape:SchedulingContextSpeccarries budget, period, relative deadline, byte-oriented CPU mask, and overrun policy;SchedulingContextInfois a read-only snapshot withremainingBudgetNsas derived info-only state; and the kernel/runtime expose an info-onlySchedulingContext.info()cap stub for focused grant/discovery and client decode coverage. ThecpuMaskfield is a canonical little-endian bitset: CPUnis bitn % 8of byten / 8, empty means no CPUs selected, producers omit trailing zero bytes, and non-empty canonical masks end in a nonzero byte. Dispatcher budget enforcement, replenishment, bind/revoke rules, donation/return, depletion notifications, realtime islands, SQPOLL, and nohz remain deferred. - Add capability creation/bind/revoke rules and generation identity. The
second Phase E control-plane slice keeps
info()method id 0 stable, adds same-interface context creation as a bounded result-cap transfer, records at most one caller-thread binding per context generation, and revokes by advancing the context generation and clearing the matching thread metadata binding. Bootstrap grants and created contexts use the same non-wrapping context-id allocator so distinct caps cannot alias the(contextId, generation)binding key. The focusedmake run-scheduling-contextQEMU smoke proves distinct bootstrap identities, create result-cap adoption, bind/revoke, stale-generation calls, release cleanup, and the explicitinfoOnlyNoDispatchChangedispatch-effect marker. Stale caps reportstaleGenerationand cannot mutate scheduler metadata; revoked contexts reportrevoked. Dispatch selection, WFQ ordering, runtime charging, replenishment, donation/return, timeout/depletion notification, realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain future work. - Enforce budget and replenishment in the kernel dispatcher. First Phase E
budget enforcement landed 2026-05-11 08:38 UTC:
bindCallerThread()now installs a fixed per-thread budget ledger under the scheduler/process locking model, runtime charge decrements the bound context budget at the existing dispatch charge points, runnable selection replenishes elapsed periods without allocation, and exhausted contexts stay queued butRetryLateruntil their next period. Deadline-driven accounting closed the previous periodic-tick granularity caveat on 2026-06-04: the ordinary dispatch path arms a sub-tick budget-exhaustion one-shot when the selected thread’s remaining budget would deplete before the next scheduler tick, kernel-mode one-shot fires restore a live periodic timer, nohz re-arm folds the leased thread’s budget deadline into its existing nearest deadline, and nohz budget depletion restores the periodic tick withreason=scheduling-context-budget-throttled.make run-scheduling-contextproves visible charge, replenishment to full budget, stale/revoked fail-closed behavior, and a throttled wall-clock window withdispatch_effect=budgetEnforced; the representative 5 ms deadline marker recordedelapsed_since_arm_ns=5474819,overshoot_ns=474819,remaining_after_ns=0, andbounded_charge=true. At that slice’s landing, donation/return, depletion notifications, realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remained future work. - Add endpoint donation/return semantics for synchronous calls and passive
services. Completed 2026-05-11 10:51 UTC: endpoint in-flight call state
now carries a bounded internal donation token when a caller with a bound
SchedulingContextdelivers a synchronous CALL to a receiver thread without its own context. The scheduler charges pre-donation caller runtime before moving the ledger, charges passive-server runtime before returning the ledger, and returns the remaining budget to the caller before waking it when RETURN commits, commits an application exception, or fails with an invalid caller result buffer. RETURN preflight failures keep the in-flight donation intact; delivery/return cancellation paths return or clear the donation without allocating. A donor with an in-flight token is blocked from returning to userspace until the endpoint call returns or is canceled. Nested donation of an already donated context is rejected until stacked return tokens have a dedicated design. The focusedmake run-scheduling-contextsmoke now includes a same-process endpoint round trip withendpoint_donation=ok,endpoint_return=ok,endpoint_exception_return=ok,endpoint_invalid_return=ok, andendpoint_nested_rejected=ok, plus anendpoint_donor_block=okdelayed-servercap_enter(0, 0)proof, anendpoint_donor_fast=okfast-return race proof, and remaining-budget fields for successful RETURN, application-exception RETURN, invalid-result RETURN, nested-donation rejection, donor blocking, and fast donor return. This is synchronous endpoint donation/return only; depletion notifications, realtime islands, SQPOLL, auto-nohz, CPU placement enforcement, and session-logout stale-context coverage remain future work. - Add a scheduler-observable session lifecycle hook from
UserSession.logout()into scheduler-ownedSchedulingContextstale-marking. The hook covers explicit logout plus the remote DTO gateway logout/connection-teardown paths that already callUserSession.logout(): after the liveness cell flips to logged out, the scheduler scans process/thread metadata for the same session liveness cell, removes non-donated matching bindings from its ledger, and advances the bound context generation as revoked so ordinary old grants become stale. The hook preserves the scheduler as the binding authority and avoids scheduler-lock to context-record-lock inversion by taking one binding under the scheduler lock, dropping that lock, and then marking the context stale through its cleanup token. In-flight endpoint donation bindings are explicitly skipped because returning donor budget before endpoint cancellation would violate the donor-blocking invariant. This hook unblocks focused stale-context proofs: ordinary non-donated logout, donated-context policy, and local owner-shell propagation are now closed by their dedicated task records. - Add timeout/depletion notifications with preallocated emergency-path
storage. Completed in the timeout/depletion notification slice: every
SchedulingContextowns a fixed notification cell allocated at context creation/bootstrap, with coalescing slots for budget depletion and deadline/timeout, sequence counters, bounded coalesced-event counts, holder identity, donated-holder marking, remaining budget, and next timestamp snapshots. Scheduler charging, timeout/deadline observation, donation-return, and cancellation paths update only that fixed state; they do not allocate, publish result caps, append unbounded queues, or require hard-path logging.SchedulingContext.drainNotifications()exposes typedok,revoked, andstaleGenerationobserver results, plusexplicitRevokelifecycle state. The focusedmake run-scheduling-contextsmoke proves repeated budget-depletion coalescing, deadline notification, explicit revoke, stale observer labels, and endpoint-donated notification accounting. A pre-armed observer waiter/wakeup path remains a separate follow-up. - Extend stale-context proofs beyond the first revoke/generation contract to process and thread exit. The focused SchedulingContext smoke now proves that a context bound by an exiting thread becomes unbound without minting fresh budget on rebind, while process-exit and explicit process-termination children bind contexts and run the process cleanup path before cap-table release.
- Extend stale-context proofs to session logout. Completed for ordinary
non-donated contexts at 2026-05-11 17:44 UTC. This remains separate from
process/thread exit because logout propagation is owned by the session
lifecycle surface, not the scheduler dispatch loop. The focused
session-context smoke now binds a
SchedulingContextin a session-owned child, callsUserSession.logout(), observes the scheduler hook line, and proves the old cap is stale before budget refresh, caller-thread rebind, result-cap publication, or metadata mutation. Process/thread exit cleanup remains covered bymake run-scheduling-context. - Prove donated receiver logout policy. Completed at 2026-05-11 18:19 UTC.
Logout keeps the existing conservative counted/skipped behavior for
receiver threads holding endpoint-donated
SchedulingContextbindings. The focused session-context smoke has a donor call a guest-session receiver, the receiver logs out while holding the donated binding, the scheduler hook reportsstale_marked=0 donation_inflight_skipped=1, the donor remains blocked incap_enter(0, 0)until endpoint RETURN, and the donor context returns bound with reduced remaining budget rather than a refreshed or minted budget. Local owner-shell lifecycle propagation was closed separately byscheduler-phase-e-local-owner-shell-logout-propagation. - Propagate local owner-shell exit to session logout. Completed at
2026-05-11 19:36 UTC. Clean local REPL
exitand terminal-close completion now call the heldUserSession.logout()before process exit, so the session liveness cell is marked logged out through the same kernel hook used by explicit logout and the remote DTO gateway. The shell smoke asserts the scheduler-observable hook line withstale_marked=0 donation_inflight_skipped=0; ordinary boundSchedulingContextstale behavior remains proven by the focused session-context smoke through the same hook. Process/thread-exit cleanup remains separate and unchanged.
Phase F: CPU Isolation Lease and SQPOLL
The Phase E gates and the first Ring/SQPOLL ownership prerequisite are now
closed. Dispatch through
scheduler-phase-f-auto-nohz-sqpoll
only through its own Phase F authority, telemetry, rollback, and nohz/SQPOLL
tasks; this backlog entry does not implement Phase F behavior. The concrete
ring prerequisite is
scheduler-phase-f-one-sq-consumer-ring-ownership,
closed on 2026-05-11: ring endpoints now have generation-checked syscall-mode
SQ-consumer leases, duplicate future SQPOLL acquisition is rejected while that
owner is live, stale owner generations cannot advance SQ head, teardown
releases the owner without clearing accepted completions, and bounded SQPOLL
admission metadata exists without starting a poller.
The first executable Phase F child task,
scheduler-phase-f-cpu-isolation-lease-scaffold,
closed on 2026-05-12 12:02 UTC. It is limited to CpuIsolationLease authority,
activation preflight telemetry, and rollback scaffolding. It does not enable
SQPOLL, automatic nohz, tick suppression, automatic CPU isolation, or generic
full-nohz behavior. The second executable child task,
scheduler-phase-f-nohz-activation-telemetry,
closed on 2026-05-12 14:18 UTC. It turns the disabled preflight into observable
activation/deactivation and rollback decisions while still leaving tick
suppression, SQPOLL, automatic CPU isolation, and generic full-nohz disabled.
The housekeeping/deferred-work placement child closed on 2026-05-12 18:36 UTC
by
scheduler-phase-f-housekeeping-deferred-work-placement:
the scheduler now records an explicit online housekeeping CPU placement input,
selected housekeeping mask, deferred cleanup/timer/network/IRQ/accounting
placement or rejection labels, and bounded revoke, process-exit,
service-replacement, and session-logout cleanup placement while ticks remain
periodic.
The bounded SQPOLL ring-mode child closed on 2026-05-12 20:29 UTC by
scheduler-phase-f-sqpoll-ring-mode-bounded-poller:
ring endpoints now transition explicitly through syscall, SQPOLL starting,
running, sleeping, stopping, and rollback modes; a kernelSqpoll
CpuIsolationLease admits one bounded periodic-tick poller for the caller
thread’s ring; producer wakeups use NEED_WAKEUP; stale SQ owners fail before
SQ-head consumption; and poller stop/revoke preserves accepted CQEs while
releasing SQ ownership. Actual tick suppression is blocked until the
SQPOLL progress path no longer depends on periodic scheduler ticks. The
clockevent/deadline substrate child closed on 2026-05-12 23:07 UTC by
scheduler-phase-f-clockevent-deadline-substrate:
normal QEMU/x86_64 monotonic_ns() is backed by the calibrated TSC rather
than TICK_COUNT, the periodic LAPIC tick disciplines the TSC epoch while nohz
is disabled, Timer.sleep, finite cap_enter, and park waiters store
absolute monotonic deadlines, and the LAPIC clockevent backend can program a
bounded one-shot deadline and restore periodic mode. The substrate’s firing
precision is now proven, not only its programming: the
scheduler-lapic-oneshot-subtick-firing-precision child (closed
2026-06-04 03:26 UTC, commit 49b36129) arms a TICK_NS/2 one-shot over the live
periodic timer during boot and
measures the actual countdown-to-fire instant, asserting via
make run-scheduling-context that it fires sub-tick (~5 ms for a 5 ms request,
well under the 10 ms tick) with the current-count correctly reset to the
sub-tick value – ruling out the suspected “INITIAL_COUNT write does not reset
the running countdown” root cause – and that the kernel-mode-fire periodic
restore leaves a live timer (no lost-timer hang). Automatic nohz, tick
suppression, SQPOLL nohz, generic full-nohz, and production realtime admission
remain disabled. Known pre-existing gate flake (independent of the
firing-precision proof, which passed in 100% of measured boots): the
scheduling-context-smoke budget-timing proof exited early in ~20% of boots on
both main and this branch under host load – its wall-clock budget-throttle
assertions are sensitive to host scheduling jitter. Run make run-scheduling-context on an otherwise-idle host until the budget proof is
stabilized (own follow-up); it is orthogonal to the clockevent firing assertions.
A second substrate prerequisite surfaced 2026-06-04 from
scheduler-deadline-driven-budget-accounting’s Attempt 2: even with the LAPIC
one-shot firing precisely sub-tick, the monotonic clocksource discipline floored
a sub-tick interval to a full tick. A boot probe measured a real 5.0 ms interval
advancing monotonic_ns by 10.0 ms after one discipline_clocksource_tick step
(monotonic_delta_ns=10000020 for real_ns=5000118, floored=true), because
discipline_clocksource_tick took max(tsc_interpolated, epoch + TICK_NS) on
every fire. That was the real cause of that task’s Attempt 1 “9.85 ms” – not the
LAPIC firing (fixed) and not the ordinary-path timer-ISR rechecks (which provably
no-op when no nohz/idle window is active). The prerequisite
scheduler-monotonic-clocksource-subtick-discipline
closed it (2026-06-04): discipline_clocksource_tick now trusts the TSC
interpolation at sub-tick granularity, falling back to the TICK_NS floor only
when the interpolated advance is below MIN_DISCIPLINED_ADVANCE_NS (TICK_NS / 8)
so a degenerate (stalled/backward/mis-calibrated-slow) TSC still keeps a minimum
forward rate; the tick-derived fallback is unchanged. A boot proof
(context::qemu_clocksource_subtick_discipline_proof, emitted on
make run-scheduling-context) runs one real TICK_NS / 2 discipline step and
asserts monotonic_ns() tracked the sub-tick delta – measured
monotonic_delta_ns=5055612 for real_ns=5000474 (floored=false,
subtick_tracked=true). Deadline-driven budget accounting and generic full-nohz
can now observe a sub-tick deadline through the accounting clock.
The SQPOLL nohz-progress child closed on 2026-05-13 00:06 UTC by
scheduler-phase-f-sqpoll-nohz-progress:
cap_enter now has a bounded current-thread SQPOLL service entry for
producer wakes and syscall kicks that borrows the SQPOLL owner lease, charges
the admitted accounting target, and reports non-periodic progress evidence
while ordinary periodic service remains active. Automatic policy-service nohz
issuance and production realtime admission remain future work; generic SQPOLL
nohz for explicitly leased caller-thread rings landed in the later Step 14
slice.
The tickless-idle child closed on 2026-05-23 09:12 UTC by
scheduler-tickless-idle-step6:
the CPL0 idle loop now admits an idle-only tickless window when no non-idle
work is runnable, no nohz lease is active, no local deferred cleanup is
pending, no cap-enter polling dependency is present, and the LAPIC one-shot
clockevent plus monotonic clocksource are available. The periodic tick is
restored before non-idle dispatch and on rollback. Legacy cap-enter polling
surfaces, including the terminal shell path, remain periodic until they gain
explicit deadline or housekeeping placement.
- Define
CpuIsolationLeaseauthority separately from CPU-time budget. Completed 2026-05-12 12:02 UTC bydocs/tasks/done/2026/scheduler-phase-f-cpu-isolation-lease-scaffold.md. - Add scheduler activation proof for housekeeping, deferred cleanup, timers, networking, IRQ affinity, live accounting target, one-SQ-consumer state, and revocation latency. The scaffold reports blocked eligibility and leaves ticks/nohz/SQPOLL disabled.
- Enforce one live SQ consumer per ring before SQPOLL. Completed
2026-05-11 by
docs/tasks/done/2026/scheduler-phase-f-one-sq-consumer-ring-ownership.md. - Integrate SQPOLL ring mode only after this ownership prerequisite and
docs/tasks/done/2026/scheduler-phase-f-housekeeping-deferred-work-placement.mdhave landed. Completed 2026-05-12 20:29 UTC bydocs/tasks/done/2026/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md. - Add lease revocation on explicit revoke, process exit, service
replacement, and session close. Completed by the focused
make run-scheduler-cpu-isolation-leaseproof. - Add nohz activation/deactivation telemetry. Completed 2026-05-12 14:18 UTC by
docs/tasks/done/2026/scheduler-phase-f-nohz-activation-telemetry.md. The proof records active-candidate rejection, stale/revoked rollback, ready housekeeping CPUs under-smp 4, exactly-one-runnable target CPU evidence, deferred cleanup/timer/network/IRQ labels, valid accounting targets, explicit clocksource/accounting readiness or refusal, live syscall SQ-consumer state, revocation-latency policy, and disabled tick/SQPOLL/full-nohz guardrails. - Assign housekeeping and deferred-work placement before behavior.
Completed 2026-05-12 18:36 UTC by
docs/tasks/done/2026/scheduler-phase-f-housekeeping-deferred-work-placement.md. The proof keeps periodic ticks, SQPOLL, automatic CPU isolation, and generic full-nohz disabled. - Add bounded SQPOLL ring mode only after housekeeping/deferred-work
placement. Completed 2026-05-12 20:29 UTC by
docs/tasks/done/2026/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md. The proof covers one poller owner, bounded polling, stale queue-owner rejection, wake/sleep ordering, and teardown without losing completions while periodic ticks remain active. - Add clockevent/deadline substrate before automatic nohz activation.
Completed 2026-05-12 23:07 UTC by
docs/tasks/done/2026/scheduler-phase-f-clockevent-deadline-substrate.md. It split clocksource reads from clockevent programming, added a one-shot/restore timer backend, and converted tick-count waiters to absolute monotonic deadlines while ordinary scheduling remains periodic. - Add SQPOLL nohz progress that does not depend on periodic scheduler
ticks. Completed 2026-05-13 00:06 UTC by
docs/tasks/done/2026/scheduler-phase-f-sqpoll-nohz-progress.md. The proof preserves the one-SQ-consumer,NEED_WAKEUP, bounded polling, stale-owner rollback, and teardown/completion invariants while keeping periodic fallback service active. - Add automatic nohz activation only after placement, bounded SQPOLL
behavior, the deadline substrate, and non-periodic SQPOLL progress.
Completed 2026-05-14 09:01 UTC by
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md. TheCpuIsolationLeaseactivation preflight now performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window (namedRing = nonecompute lease on the preflight CPU): it masks the periodic LAPIC tick and arms a bounded one-shot deadline atmin(nearest pending timer wakeup, now + max revocation latency). Network polling and IRQ affinity stay read-only fail-closed admission gates – any ring-coupled or device-owning mode keeps the conservative refusal. Every disqualifying change (stale lease generation, a second runnable entity, stealable sibling work, a local deferred-cleanup dependency, a target-CPU mismatch, or a one-shot backend that can no longer arm a deadline) rolls the CPU back to the periodic tick first. Themake run-scheduler-cpu-isolation-leaseproof asserts the activation and rollback log lines. Generic full-nohz and the broader SQPOLL-driven nohz state machine landed in later slices. - Measured suppressed-tick proof on the lease path (harness-hardening).
Completed 2026-06-02 19:53 UTC by
docs/tasks/done/2026-06-02/scheduler-cpu-isolation-measured-suppressed-tick-proof.md. Closes the review-identified honesty gap that the lease path proved suppression only by thetick_suppression=active periodic_tick=maskedmarker plus a no-hang progress loop, never that periodic timer interrupts actually stopped arriving. The kernel now counts genuine periodic LAPIC fires per CPU (account_timer_firein the timer ISR increments only when neither the lease-backed nor idle tick-suppression bit is set, so the one-shot replacement is never miscounted), snapshots the count at activation, and on rollback emitscpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w> expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>; a bounded post-rollbackcpu-isolation: nohz restored-rateline proves the periodic rate returns. The demo holds a childless compute lease on CPU 0 across a ~150 ms masked window, then a busy restore window; the harness asserts a masked window withactual_periodicnear zero (expected_periodic >= 10,suppressed >= 8) and a restored window withactual_periodictrackingexpected_periodic(>= 8). No activation behavior changed; the mask/one-shot mechanism is untouched. A durableticks_suppressed{cpu,mode}telemetry field on a monitoring/status surface remains future work. - Timeout-based auto-revoke primitive on
CpuIsolationLease. Landed viadocs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md. AddsleaseLifetimeNs @6toCpuIsolationLeaseSpec(0= no expiry, preserving every existing producer);read_specclamps to a one-hour ceiling and rejects a non-zero lifetime belowmaxRevocationLatencyNs(invalidSpec). A lease recordsexpires_at_nsat creation; the first observation past the deadline auto-revokes through the existing generation-advancing cleanup (reason=lease-expired, registry unregister, SQPOLL stop,rollback_nohz_for_lease) and every subsequentinfo/activationPreflight/revokereportsstaleGeneration. The nohz activation record carries the lifetime deadline so a tickless CPU under a lease that crosses its lifetime rolls back at the next timer/IPI recheck (lease-lifetime-expireddisqualifier), bounded bymaxRevocationLatencyNs.make run-scheduler-cpu-isolation-leaseasserts the expiry release line, the post-expirystaleGeneration, and theinvalidSpecrejection. - Enable tickless idle only when there is no runnable non-idle work and no
cap-enter polling dependency. Completed 2026-05-23 09:12 UTC by
docs/tasks/done/2026/scheduler-tickless-idle-step6.md. The idle path masks the periodic LAPIC tick only for true idle, arms a bounded one-shot at the nearestTimer/ParkSpacedeadline or 100 ms housekeeping floor, and restores periodic mode before ordinary work. Ready-but-budget-throttledSchedulingContextretry windows remain periodic so budget replenishment and deadline notification timing stay on the existing scheduler accounting path. - Keep automatic full-nohz behind the completed one-SQ-consumer ownership
prerequisite and the narrower
CpuIsolationLeasetelemetry/rollback proof. Generic full-nohz is not the first Phase F implementation task.
Phase F.5: Full-SMP Hardware Scalability
This phase is the planning slot for the next visible SMP milestone when the project is ready to answer whether capOS uses 16/32-core machines well. It does not replace the current Installable System selected milestone and should not be dispatched as a QEMU-only benchmark cleanup. QEMU remains regression infrastructure; the primary performance record should come from direct capOS execution on a dedicated high-core perf runner or bare-metal/cloud-bare-metal machine.
- Replace temporary four-owner scheduler assumptions with dynamic CPU topology: discovered scheduler CPU set, physical-core versus SMT sibling labeling, APIC id mapping, per-CPU allocation sizing, and boot/status output that makes the selected CPU set auditable.
- Add or select the APIC backend needed for high-core machines. xAPIC MMIO
can remain the current low-core path, but x2APIC selection is the likely
larger-APIC-id follow-up from
docs/research/x2apic-and-virtualization.md. - Shrink scheduler shared-state serialization. Local pick/requeue should avoid one global scheduler-lock critical section where possible, while shared process/thread metadata, blocking waiters, direct IPC handoff, timers/deadlines, and cleanup keep explicit ownership and rollback rules.
- Add topology-aware placement and observable migration policy. The record should distinguish local enqueue, cross-core wake, steal, SMT sibling placement, failed placement, reschedule IPI, and TLB-shootdown costs.
- Build the hardware benchmark profile from existing benchmark proposals: static map/reduce, uneven dynamic task pool, barrier phase loop, independent processes, same-process threads, and one capability-call/service-bound workload. Each workload reports work-window and total-time rows at 1/2/4/8/16/32 workers when hardware exists.
- Record matching native Linux rows on the same machine, plus capOS raw artifacts with source commit, toolchain, topology, frequency/isolation policy, run count, warmup policy, verifier output, medians, variance, speedup, efficiency, and scheduler counters.
Phase G: Realtime Islands
- Define
RealtimeIslandadmission inputs: scheduling contexts, memory reservations, device/IRQ reservations, communication paths, CPU leases, and overrun policy. - Add a small local-audio or synthetic periodic-control proof before robotics or provider workloads.
- Prove no allocation, blocking endpoint call, paging, or logging on the admitted realtime path.
- Record deadline misses and overrun handling as observable output.
Phase H: Policy Service
- Define a privileged scheduler policy service interface for admission, budget/profile updates, CPU lease grant/revoke, and diagnostics.
- Keep kernel fallback scheduling independent of policy-service liveness.
- Add manifest/config hooks for default profiles without making policy changes require kernel rebuilds.
- Add operator diagnostics that explain why a thread or island was denied, throttled, migrated, or revoked.
- Define how stateful task/job graph assignment metadata maps into
scheduler policy inputs: graph priority to weight/latency class, graph
deadline to request freshness or admission input, graph budget to
SchedulingContextreference, and graph queue to policy-service placement. The graph coordinator must not mint CPU authority by itself. - Design the user-space policy-service AutoNoHz placement heuristic for
ordinary threads that appear capable of utilizing a full CPU core. The
policy service synthesizes the “thread appears capable of utilizing a
full CPU core” decision from a future monitoring/status surface and
issues a bounded
CpuIsolationLeaseagainst a pre-authorized account or session CPU pool. The lease is placement only; it does not mint CPU-time authority. Required bounds on every auto-issued lease: lifetime shorter than admin-issued leases by default and renewable only by re-observing the signal;max_revocation_latency_nsbounded byNoHzEligibility; accounting target a liveSchedulingContextor coarseResourceLedger; CPU set restricted to the operator-declared auto-claim pool; priority-aware fairness preemption that terminates the lease (not just rolls back tick suppression) on arrival of an equal-or-higher priority runnable entity. Prerequisites: (a) a timeout-based auto-revoke primitive onCpuIsolationLease– LANDED 2026-05-30 asleaseLifetimeNs @6(0= no expiry) with enforced first-observation auto-revoke and alease-lifetime-expirednohz rollback; the auto-claim placement lease can now be granted with a bounded lifetime. The boundedrenewhalf LANDED asCpuIsolationLease.renew @4, which pushes the deadline forward by at most the original lifetime while keeping the lease’s identity / accounting / nohz state, leaving only the renewal-by-re-observation heuristic (when to callrenew) to Phase H; (b) the monitoring/status surface that exports per-thread saturation observation – LANDED 2026-05-30 as the non-measureper-thread saturation status surface.voluntary_blocksandpreemptionswere promoted out ofcfg(feature = "measure"), an always-builtrunnable_accumulated_nsrunnable-but-not-running accumulator was added (stamped at the run-queue enqueue chokepoint, accumulated at selection), and all three plusruntime_nsare exported throughSchedulingPolicyCap.snapshot @2(proofmake run-thread-fairness: hogvoluntary_blocks=0with livepreemptions/runnable_ns).migrationsstaysmeasure-gated. This read-side surface exports raw cumulative counters only; windowing and the saturation decision remain policy-service work; (c) the pool-grant authority shape that lets an operator pre-authorize an account’s auto-claim pool. Declared-pool descriptor LANDED 2026-05-30: theCpuIsolationLeaseSpeccarriespoolId @7(0= the implicit default pool over every scheduler CPU), the kernel seeds a fixed declared-pool registry (CpuIsolationPoolDescriptor: default pool0plus one declared non-default pool1over a single CPU), andread_specadmits a lease only when itspoolIdis declared and itsallowedCpuMaskis a subset of the pool’s CPU mask – echoing the admitting pool’s id/mask throughCpuIsolationLeaseInfo(proofmake run-scheduler-cpu-isolation-lease:nondefault_pool=invalidSpec(undeclared id),declared_pool=ok admitted_pool_id=1 admitted_pool_cpu_mask_subset=true,declared_pool_mask_violation=invalidSpec,default_pool_id=0). Manifest-sourced pool table LANDED 2026-05-30: the declared-pool registry is sourced from the boot manifestSystemConfig.cpuIsolationPools @14(each entry aCpuIsolationPoolDescriptor), with the in-kernel constant as the fail-closed default when the manifest omits/empties the list; the kernel validates each entry fail-closed at boot (canonical CPU mask subset of the scheduler mask, default pool0synthesized if omitted, duplicate ids rejected) and emitscpu-isolation: declared-pools source=manifest count=3 ...(proofmake run-scheduler-cpu-isolation-lease; kernel-default fallback proven bycargo test-configdecode/empty assertions). Per-pool live-lease capacity bound LANDED 2026-05-31:CpuIsolationPoolDescriptorcarriespoolMaxLeases @2(0= unbounded); a non-zero value caps the number of simultaneously live (non-revoked, current-generation) leases the kernel admits against that pool at create-time, counted from the existingLEASE_REGISTRYafterprune_dead, rejecting an over-capacity create fail-closedresourceExhausted. The manifest bounds pool2atpoolMaxLeases: 2; the proof admits two live leases, refuses a third (cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2 result=resourceExhausted,pool_capacity_exceeded=resourceExhausted), and reclaims after a revoke (pool_capacity_reclaimed=ok) – live-count, not cumulative. This is the count+reject mechanism the per-accountNpolicy keys onto. Account identity + per-accountNLANDED 2026-05-31:CpuIsolationLeaseSpeccarriesaccountId @8 :UInt64(0= unattributed, caller-asserted and inert until counted, echoed read-only throughCpuIsolationLeaseInfo.accountId @6) andCpuIsolationPoolDescriptorcarriespoolMaxLeasesPerAccount @3 :UInt32(0= unbounded per account). After the pool-wide check,registercounts the requesting account’s live entries (admitted_pool_idANDaccount_idboth matching) against the per-account bound and rejects an over-bound create fail-closedresourceExhausted(0account or0bound skips the gate). The manifest bounds pool2atpoolMaxLeasesPerAccount: 1; the proof admits one account-7 lease, refuses a second account-7 create (cpu-isolation: account-capacity-rejected admitted_pool_id=2 account_id=7 account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted,account_capacity_exceeded=resourceExhausted), admits a different account-9 lease on that CPU (account_capacity_other_account=ok– per-account, not pool-wide), and reclaims after revoking account-7 (account_capacity_reclaimed=ok). The account id is caller-asserted, not yet authenticated. Bootstrap pool-grant authentication LANDED 2026-05-31:CpuIsolationPoolGrant(schema/capos.capnp, sourcecpu_isolation_pool_grant, kernelkernel/src/cap/cpu_isolation_pool_grant.rs) introduced a bootstrap-staged grant binding one authenticated account to one declared pool.createLeasestamps the bound account/pool onto the minted lease, overriding any caller-assertedaccountId/poolId, and reuses the exact lease-create admission path (cpu_isolation::create_lease_for_caller), so the per-account bound is unforgeable: a holder can no longer assert another account to evadepoolMaxLeasesPerAccount. The initial proof used one account-7/pool-2 grant; the current manifest-sourced proof below exercises multiple seeded grants. Manifest-declared multi-account grant table LANDED 2026-06-01: the grant binding is now operator-declared viaSystemConfig.cpuIsolationPoolGrants(schema/capos.capnp, decoded incapos-config, seeded at boot bycpu_isolation_pool_grant::seed_pool_grantsafterseed_declared_pools), mirroring the manifest-sourcedcpuIsolationPoolstable; thecpu_isolation_pool_grant/cpu_isolation_pool_grant_secondarysources stage seeded binding index0/1, so a manifest can pre-authorize multiple distinct(account, pool)grants, each staged as its own bootstrap cap. An absent/empty list falls back to one in-kernel binding at index0: account7bound to preferred pool1when active, otherwise account7bound to synthesized default pool0, so manifest-sourced pool tables that omit pool1still stage a usable default grant. Proofmake run-scheduler-cpu-isolation-pool-grantnow boots a two-entry grant table (account5/pool1, account8/pool2), holds both grant caps, and proves each stamps its OWN bound account (pool-grant: create ok bound=A stamped_account_id=5 .../bound=B stamped_account_id=8 ...) with the per-account bound still enforced fail-closed under the manifest-sourced path; boot evidencecpu-isolation: pool-grants source=manifest count=2. Fallback proofmake run-scheduler-cpu-isolation-pool-grant-defaultboots a manifest-sourced pool table that declares pool2and omits pool1plus an empty grant list; the kernel stages one default grant as(account 7, pool 0)and the smoke proves it can mint a stamped lease. Runtime grant minting landed (CpuIsolationGrantMinter): one cap mints a freshCpuIsolationPoolGrantfor an operator-chosen(account, pool)at call time, bounded by the declaredSystemConfig.cpuIsolationGrantMinterAllowlist(an out-of-allowlist mint is refusedunauthorized, so it is never an ambient grant-any authority; the minted grant reuses the same unforgeablecreateLeaseadmission path). The samerun-scheduler-cpu-isolation-pool-grantsmoke now also mints a grant for the allowed(account 6, pool 2), proves itscreateLeasestamps account6and stays bounded by the per-account gate, and proves an out-of-allowlist(account 99, pool 2)mint is refused; boot evidencecpu-isolation: grant-minter-allowlist source=manifest count=1. Grant-revocation lifecycle landed (CpuIsolationGrantMinter.revokeGrant): a runtime-minted grant gets a revocable(grantId, generation)identity;revokeGrant(grantId)advances the grant generation so a stale grant handle’screateLeasefailsstaleGeneration, and cascades to every live lease minted through it – reusing the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) so the per-pool/per-account live-lease capacity frees immediately and a fresh grant is admitted into the reclaimed slot. Double-revoke isalreadyRevokedand an unknowngrantIdisunknownGrant, both fail-closed. The samerun-scheduler-cpu-isolation-pool-grantsmoke proves the full lifecycle. This closes Track C (prerequisite (c)) – operator grant authority is now mint + revoke complete. Detailed design indocs/proposals/tickless-realtime-scheduling-proposal.md“Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads”.
AutoNoHz Decomposition: Roadmap to Full Auto-NoHz
The status bullet above narrates what landed. This subsection is the discrete dispatchable decomposition from the current landed state to full operator-driven auto-nohz, so the path is written as concrete slices rather than “future work” prose. Grounding: the proposal’s “Policy-Service Userstories: AutoNoHz Placement”, “Bounds the policy service must enforce”, “Telemetry Requirements”, and Implementation Sequence steps 7/14/17.
Landed substrate (not repeated below): the narrow manual per-CPU LAPIC
tick-mask for the single-runnable compute window and the SQPOLL-coupled
window, tickless idle, prerequisite (a) leaseLifetimeNs @6 timeout
auto-revoke, prerequisite (b) the SchedulingPolicyCap.snapshot @2
saturation observation surface, and prerequisite (c) pool-grant authority now
mint + revoke complete (the manifest-declared multi-account
cpuIsolationPoolGrants @15 table, runtime grant minting through
CpuIsolationGrantMinter, and the grant-revocation lifecycle that cascades to
minted leases). Fairness lease termination (Track D) and a measured
suppressed-tick proof have also landed, as have network-poll and IRQ-affinity
housekeeping routing, kernel-side generic full-nohz admission for ordinary
budgeted compute threads, and generic SQPOLL nohz admission for explicitly
leased caller-thread rings. What the name “auto nohz” still oversells today:
there is no production policy service, and broader userspace-poller/device-queue
issuance remains future work. Each remaining slice below closes one of those.
Conflict-domain note: every kernel slice here shares
resource:scheduler-cpu-isolation and writes kernel/src/cap/cpu_isolation*
or kernel/src/sched.rs, so they serialize against each other – dispatch
the chain head first; the rest convert from this list into
docs/tasks/ records as their depends_on closes. Slices marked
ready have a task record under docs/tasks/; the rest stay here
until their prerequisite lands.
Next increment (decomposed 2026-06-04 00:18 UTC; updated 2026-06-07 after
generic SQPOLL nohz landed): Track C, Track D, and the measured suppressed-tick
proof are all landed, and the ordinary-thread and SQPOLL-ring kernel admission
leaves are now done.
Records under docs/tasks/ capture:
scheduler-cpu-isolation-lease-renewal-on-reobservation (renewal residual),
scheduler-nohz-irq-affinity-housekeeping-routing,
scheduler-nohz-network-poll-housekeeping-routing,
scheduler-deadline-driven-budget-accounting, and
scheduler-generic-full-nohz-arbitrary-threads as done. The remaining
operator-driven AutoNoHz capstone is the policy service.
These scheduler CPU-isolation slices serialize against each other on
resource:scheduler-cpu-isolation but are parallel-safe against the in-flight
Phase C network-stack lane, so the scheduler lane stays runnable whenever Phase
C 7c holds the kernel cap/ surface.
Track C – complete operator grant authority (prerequisite (c) residual):
-
scheduler-cpu-isolation-runtime-grant-minting– behavior, normal, LANDED 2026-06-02 22:24 UTC. One cap (CpuIsolationGrantMinter) mints a freshCpuIsolationPoolGrantfor an operator-chosen(account, pool)at call time, bounded by the declaredSystemConfig.cpuIsolationGrantMinterAllowlist(an out-of-allowlist pair is refusedunauthorized), instead of only the boot-seeded table. The minted grant reuses the same unforgeablecreateLeaseadmission path. Proofmake run-scheduler-cpu-isolation-pool-grant. depends_on: manifest-multi-account grant table (landed). -
scheduler-cpu-isolation-grant-revocation-lifecycle– behavior, normal, LANDED 2026-06-03 17:11 UTC.CpuIsolationGrantMinter.revokeGrantrevokes a runtime-minted grant by advancing its(grantId, generation)so latercreateLeasethrough the stale handle failsstaleGenerationand mints nothing; revocation cascades to every live lease minted through that grant, driving the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) once per tagged lease so per-pool/per-account capacity frees immediately (a fresh grant’s lease is admitted into the reclaimed slot in the proof). Double-revoke isalreadyRevoked, unknowngrantIdisunknownGrant, seeded grants stay un-revocable. Closes Track C. Proofmake run-scheduler-cpu-isolation-pool-grant. depends_on:scheduler-cpu-isolation-runtime-grant-minting(landed),scheduler-cpu-isolation-priority-aware-lease-termination(landed).
Track D – fairness preemption (proposal fairness_preemption):
-
scheduler-cpu-isolation-priority-aware-lease-termination– behavior, normal, LANDED 2026-06-02 21:17 UTC. On arrival of an equal-or-higher policy-priority runnable on the leased CPU when no other CPU authorized by both the admitted pool and the leaseallowedCpuMaskis eligible, the kernel now terminates (revokes) the lease itself at the existing nohz rollback site (fairness-preempted ... result=lease-terminated), not just restores the periodic tick, bounded bymaxRevocationLatencyNs. The recheck compares the static WFQ policy priority (latency_class,weight) of the arriving entity against the captured leased thread; a strictly-lower arrival or an eligible sibling CPU inside both masks keeps the existing tick-restore-only behavior. The termination runs the same generation-advancing cleanupleaseLifetimeNsexpiry uses (reason=fairness-preempted) immediately after the scheduler restores the periodic tick, so a subsequentinfo/revokereportsstaleGenerationand placement/account capacity is freed without waiting for the holder’s next cap call. Proven inmake run-scheduler-cpu-isolation-lease(default pool0withallowedCpuMask=0x01: an equal-priority sibling terminates and capacity is reclaimed, a strictly-lower sibling restores only). Out: no re-placement onto an eligible sibling CPU (the “no sibling eligible” condition is recorded; actual migration is generic-full-nohz work). depends_on: auto-nohz-activation (landed).
Lease lifetime renewal (proposal lifetime_ns renewal residual):
-
scheduler-cpu-isolation-lease-renewal-on-reobservation– behavior, normal, landed.CpuIsolationLease.renew @4pushesexpires_at_nsforward tonow + leaseLifetimeNs(clamped to the same one-hour ceilingread_specenforces), keeping the same(leaseId, generation), accounting binding, and nohz activation state. Callable only before expiry: a revoked, auto-revoked, or past-deadline lease stays stale (staleGeneration) and is not resurrected, and an unboundedleaseLifetimeNs = 0(or factory) lease reportsnotRenewable. The renewed deadline is propagated to a tickless CPU’s nohz activation record (renew_nohz_lifetime_deadline_for_lease) so thelease-lifetime-expireddisqualifier no longer rolls it back at the old deadline.CpuIsolationLeaseInfo.expiresAtNsechoes the deadline read-only. The kernel primitive the policy service uses to renew an auto-issued lease by re-observing the saturation signal; the re-observation heuristic itself stays Phase H policy-service work. Proofmake run-scheduler-cpu-isolation-lease. depends_on: timeout-auto-revoke (landed).
Honesty / telemetry (proposal Telemetry ticks_suppressed{cpu,mode}):
-
scheduler-cpu-isolation-measured-suppressed-tick-proof– harness-hardening, normal, LANDED 2026-06-02 19:53 UTC (docs/tasks/done/2026-06-02/scheduler-cpu-isolation-measured-suppressed-tick-proof.md). A kernel expected-vs-actual periodic-tick counter (account_timer_fire, counted only when no tick-suppression bit is set) over a bounded nohz window is asserted inmake run-scheduler-cpu-isolation-lease(cpu-isolation: nohz suppressed-ticks ...plus arestored-rateline), so the proof shows the periodic tick actually stopped firing, not only that the mask write was issued and the CPU made progress. Closed the review-identified honesty gap. A durableticks_suppressed{cpu,mode}telemetry field on a monitoring/status surface remains future work. depends_on: auto-nohz-activation (landed).
Step 7 – network poll housekeeping/deadline routing:
-
scheduler-nohz-network-poll-housekeeping-routing– behavior, normal, landed 2026-06-04 04:48 UTC. The in-kernel virtio-net poll (virtio::poll_scheduler) now routes off a lease-isolated (tickless) CPU: it consultssched::current_cpu_lease_nohz_active()and skips, emitting a boundedcpu-isolation: network-poll routed ... result=skipped-on-isolated-cpurecord, while the always-ticking housekeeping CPU the admission requires keeps the poll progressing. Thenetwork_pollingadmission gate flips from the hardrejected-periodic-network-polling-not-routed-to-housekeepingrefusal to a housekeeping-conditionedrouted-periodic-network-polling-to-housekeeping-cpuadmit (eligibility accepts therouted-prefix), and fails closed (rejected-network-polling-no-housekeeping-cpu-to-relocate) when no housekeeping CPU exists. The admittednamed_ring=Nonelease carries the routed label tick-suppressed; theCallerThreadcompute-with-ring lease’s network refusal is removed but it staysForcedPeriodicbecause IRQ affinity routing is the separate slice below. Proofmake run-scheduler-cpu-isolation-lease; regressionmake run-net. depends_on: housekeeping-deferred-work-placement (landed), auto-nohz-activation (landed). -
scheduler-nohz-irq-affinity-housekeeping-routing– behavior, normal, landed (docs/tasks/done/2026-06-04/). The activation path reroutes an opting-in leased CPU’s legacy IO-APIC redirection-entry destinations onto the selected housekeeping CPU (mask-before-reprogram + read-back, restored on rollback/revoke) before admitting tick suppression, and keeps the conservativerejected-irq-affinity-not-routed-to-housekeepingrefusal for a ring-coupled IRQ dependency that cannot be safely rerouted. Proofmake run-scheduler-cpu-isolation-lease(irq-affinity ok ... routed_admitted=true restored_on_revoke=true residual_forced_periodic=true); DDFrun-interrupt-grant/run-devicemmio-grantstay green. Scoped to a quiescent housekeeping destination: under the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination onto an actively-scheduling CPU stalls that CPU’s forward progress, so the live reroute is gated to a focused proof lease (reroute sentinelmaxRevocationLatencyNs) whose destination is idle. A general busy-destination reroute remains future work behind a destination-quiescence gate or a non-KVM-irqchip delivery backend. depends_on: auto-nohz-activation (landed).
Step 14 – generic SQPOLL nohz for arbitrary rings:
-
scheduler-generic-sqpoll-nohz-arbitrary-rings– behavior, normal, done 2026-06-07. The SQPOLL nohz state machine now admits explicitly leased caller-thread rings when the SQPOLL worker is live, the ring is running/sleeping with a non-stale owner, exactly one SQ consumer is present, and producer wake/deadline rollback are bounded. The focusedmake run-scheduler-generic-sqpoll-nohzproof drives eligible entry, producer wake, SQPOLL service, rollback, and stale-owner rejection. BroaderAutoUserspacePolleruserspace-poller/device-queue issuance remains future policy-service work. depends_on: auto-nohz-sqpoll (landed),scheduler-nohz-network-poll-housekeeping-routing.
Generic full-nohz for arbitrary threads (the kernel half of “auto”):
-
scheduler-generic-full-nohz-arbitrary-threads– behavior, normal, done 2026-06-06. Ordinary budgeted compute threads can now enter full-nohz through an explicitSchedulingContext-targetedCpuIsolationLeasewhen the single-runnable, budget-deadline, housekeeping, network-poll, IRQ-affinity, timer, lifetime, and rollback gates all pass. Missing thread budget, multiple runnable work, revoked or expired leases, unrouted dependencies, and no-housekeeping cases still fail closed. Issuance is still policy-service future work; this is only the kernel admission half. depends_on:scheduler-cpu-isolation-priority-aware-lease-termination,scheduler-nohz-network-poll-housekeeping-routing,scheduler-nohz-irq-affinity-housekeeping-routing.
Step 17 – user-space AutoNoHz policy service (capstone):
-
scheduler-autonohz-policy-service-saturation-local-proof– behavior, normal, done 2026-06-07. A userspace AutoNoHz policy-service smoke now holds an operator-declaredCpuIsolationPoolGrant, consumesSchedulingPolicyCap.snapshot @2runtime / runnable / voluntary-block / preemption counters, denies a voluntarily blocking worker, issues a bounded full-nohz lease only after a local saturation window, renews only after re-observing saturation, and proves stopped-renewal expiry leaves fallback periodic scheduling intact. The proof records the grant-stamped account/pool and the single allowed CPU mask that the kernel admitted. depends_on:scheduler-cpu-isolation-runtime-grant-minting,scheduler-cpu-isolation-lease-renewal-on-reobservation,scheduler-cpu-isolation-priority-aware-lease-termination. -
scheduler-autonohz-production-policy-daemon– behavior, normal, blocked. Replace the local smoke’s fixed single-process proof with a privileged reusable policy daemon: profile-driven smoothing/window selection, cross-process target discovery, operator policy plumbing, structured observability, and revocation/non-renewal decisions for multiple accounts and pools. The landed local proof keeps this future work replaceable without ABI churn. depends_on:scheduler-autonohz-policy-service-saturation-local-proof.
Independent hardening (makes auto-nohz budget-safe):
-
scheduler-deadline-driven-budget-accounting– behavior, normal, done 2026-06-04. ChargeSchedulingContextbudget at monotonic-deadline granularity rather than per-periodic-tick so an auto-nohz thread cannot overshoot its budget by a full tick quantum while the tick is masked. Closes the “enforcement remains periodic-tick granularity” caveat that auto-nohz made load-bearing; the task ledger isdocs/tasks/done/2026-06-04/scheduler-deadline-driven-budget-accounting.md. depends_on: Phase E budget enforcement (landed),scheduler-lapic-oneshot-subtick-firing-precision(done),scheduler-monotonic-clocksource-subtick-discipline(done).
Cleanup: Retire Benchmark-Driven Scaffolding Before Phase E
This section captures simplification work identified during the post-thread-scale
SMP/threading architecture review on 2026-05-01 23:20 EEST. None of these items
are regressions: the affected code is correct, gated behind the measure
feature where it should be, and was added intentionally during attribution and
placement slices that closed the In-Process Threading Scalability milestone.
They are recorded here so the next selected scheduler milestone does not extend
or formalize speculative SMP scaffolding that the current per-CPU WFQ scheduler
does not need.
The cleanup is subordinate to the current selected milestone and to
already-open review-finding task records. Pick it up as Phase E preflight work
before SchedulingContext claims the scheduler surface. Each removal must
preserve the documented runnable-ownership invariants from
docs/architecture/scheduling.md (single dispatch owner per live ThreadRef
across per-CPU current/handoff_current slots, the per-CPU WFQ run queues,
and the direct IPC target; scheduler-lock-contained migration; allocation-free
timer/unblock/direct-IPC-fallback/requeue/steal-requeue paths) and the recorded
benchmark-only counter policy. The 2026-05-02 per-CPU run-queue collapse and
the accepted 2026-05-10 Phase D WFQ reintroduction are now both historical
evidence: the single-global-queue shape had accepted 1-to-2 evidence but a
1-to-4 diagnostic gap (capOS 1.566x/1.538x vs Linux 3.963x/3.858x),
and Phase D manually accepted the 2026-05-10 per-CPU WFQ 1-to-4 diagnostic
(capOS 3.088x/2.700x; matching Linux 3.974x/3.850x on the same pin
set) after the harness-enforced 1-to-2 gates stayed green.
Grounding read before any slice:
docs/architecture/scheduling.mddocs/proposals/scheduler-evolution-proposal.mddocs/proposals/smp-proposal.mddocs/backlog/smp-phase-c.mdkernel/src/sched.rskernel/src/process.rskernel/src/measure.rskernel/src/arch/x86_64/{smp.rs,lapic.rs,percpu.rs,tlb.rs}
Acceptance rule for every slice below: each removal must land with a host or QEMU test that fails without it, so a future reintroduction is explicit authority work rather than silent regression of an undocumented feature.
-
2026-05-02 08:07 UTC: Retired the timer continuation fast path, its per-CPU skip budget, and the slow-path-required mirror flags. Deleted
try_continue_current_on_timer_tick,mark_timer_slow_path_required,reset_current_cpu_timer_fast_path_skip_count,note_timer_slow_path_completed_locked(both feature variants),scheduler_has_hard_timer_slow_path_work_locked_excluding_endpoint_queue,scheduler_timer_slow_path_reasons_locked, theTimerBlockedWaiterKind/blocked_thread_*helpers, and the four atomic mirrorsTIMER_SLOW_PATH_REQUIRED,TIMER_FAST_PATH_SKIP_COUNTS,CURRENT_NON_IDLE_CPUS, andTIMER_FAST_PATH_MAX_CONSECUTIVE_SKIPS.set_current_thread_lockedno longer publishesCURRENT_NON_IDLE_CPUS. The timer interrupt entry inkernel/src/arch/x86_64/context.rsnow always callscrate::sched::schedule(context)instead of trying the lock-free fast path. Eightmark_timer_slow_path_required()call sites inkernel/src/sched.rs(run-queue publish, pending process drop, park-with-deadline, process termination queue, direct-IPC handoff, timer sleep enqueue, cap-enter-with-deadline, pending thread stack release, pending endpoint cancellation push) also dropped — they are no-ops once the fast path no longer exists. Verified thatmake run-spawnexits cleanly ([init] Spawn cap-table exhaustion check ok.,proc: process 2 exited with code 0,sched: last process exited, halting) andmake run-smokeruns the scripted login flow to operator session.cargo build --features qemuis warning-free (project rule). Reintroduce the fast path only if a future Phase D or Phase F slice ships an evidence pair where it measurably reduces scheduler-lock hold time on a contended SMP run.Follow-up partial 2026-05-02 08:39 UTC: `kernel/src/measure.rs` lost the eight public API entry points (`timer_fast_path_attempt`, `timer_fast_path_continue`, `timer_fast_path_slow_required_fallback`, `timer_fast_path_skip_budget_fallback`, `timer_fast_path_pending_reschedule_fallback`, `timer_fast_path_no_current_non_idle_fallback`, `timer_fast_path_inactive_invalid_cpu_fallback`, and `timer_slow_summary`) plus the now-orphaned `TimerSlowSummaryReasons` struct and its `requires_slow_path` impl. `cargo build --features qemu,measure` is back to warning-free. Follow-up complete 2026-05-02 21:00 UTC: the deeper deletion slice removed the seven `TIMER_FAST_PATH_*` static counters, the `TimerCounter::FastPath*` enum variants, the `TimerSlowSummaryCounter` enum, the `TIMER_SLOW_SUMMARY_*` counter arrays (`TIMER_SLOW_SUMMARY_COUNTER_VALUES`, `CASE_START_TIMER_SLOW_SUMMARY_COUNTERS`, `PREVIOUS_TIMER_SLOW_SUMMARY_COUNTERS`, `PHASE_TIMER_SLOW_SUMMARY_COUNTERS`), the `(TimerSlowSummaryCounter, &str)` reporting table, the `Snapshot.timer_slow_summary_counters` field, and the matching reset/diff/print helpers and accessors. `TIMER_COUNTER_COUNT` shrank from 11 to 4 (interrupts, user_scheduler, kernel_only, bsp_tick_advances). The `measure: timer ...` line is now compact and the `measure: timer_slow_summary ...` line is no longer emitted at all. `tools/qemu-thread-scale-harness.sh` dropped the `fast_path_*` clauses and the `timer_slow_summary` aggregate / per-phase grep checks in the same slice, satisfying the "removal must land with a host or QEMU test that fails without it" acceptance rule. Verified with `make fmt-check`, `cargo build --features qemu` (warning-free), `cargo build --features qemu,measure` (warning-free), `cargo test-lib` (171 passed), `make run-spawn`, and `make run-measure` (proof line emitted, exit 0). A local one-iteration `CAPOS_THREAD_SCALE_RUNS=1 CAPOS_THREAD_SCALE_GUEST_MEASURE=1 make run-thread-scale` was used solely as functional verification of the harness parser against the new measure-output shape (no CPU pinning, single iteration; the run reported `qemu taskset cpus: none` and the resulting medians/speedups are diagnostic only). This slice is a measure-output cleanup, not a scheduler-structure change, so it does not require controlled benchmark-VM timing evidence under the Phase A "before/after each scheduler structure change" rule; the harness fail-without-the-kernel-change pairing is the acceptance gate. -
2026-05-01 22:01 UTC: Collapsed the asymmetric scheduler CPU sizing.
MAX_SCHEDULER_CPUS = 64was deleted,MAX_SCHEDULER_CLEANUP_CPUS = 4was renamed to a singleSCHEDULER_CPUS = 4, andSchedulerDispatch.current[]resized from 64 toSCHEDULER_CPUSto matchrun_queues,handoff_current,idle_pids,idle_threads,pending_thread_stack_release,TIMER_FAST_PATH_SKIP_COUNTS, andSCHEDULER_CPU_MASK. The dualcurrent_cpu_slot()/current_cleanup_slot()helpers collapsed into a singlecurrent_cpu_slot()that bounds-checks againstSCHEDULER_CPUSand panics on overflow with"scheduler: CPU id {} exceeds scheduler-owned mask".scheduler_cpu_slot(cpu_id) -> Option<usize>retained for the non-panicking lookup. The earlier “raw CPU id 0..63 vs scheduler slot 0..3” indexing distinction is gone. Reintroduce a wider id-to-slot mapping only when a Phase D/F slice grows the scheduler-owned mask beyond the current four. Verified withcargo build --features qemuandcargo build --features qemu,measure(both warning-free) plusmake run-smokeandmake run-spawnon 2026-05-01. -
2026-05-02 09:26 UTC: Replaced the per-CPU run-queue array with a single global
run_queue: VecDeque<ThreadRef>.SchedulerDispatchkeepsrun_queue_live_reservationsas a single counter; thereserve_run_queue_capacity_for_thread_locked/release_run_queue_capacity_reservations_locked/push_reserved_run_queue_lockedtriple still bounds growth but operates on the single queue.enqueue_ready_thread_on_cpu_locked,run_queue_target_cpu_locked, thecreated_thread_target_cpu_lockedplacement chain (active_ready_scheduler_cpu_mask,non_idle_dispatch_load_locked,least_loaded_scheduler_cpu_*,caller_current_scheduler_cpu_slot_locked), theCreatedThreadPublishPolicy/CreatedThreadTargettypes, thescheduler_cpu_scan_orderhelper, and thecrate::measure::thread_placement_publish_caller_*reporting surface are all gone.WakePolicy::QueueCpu(usize)collapsed toWakePolicy::QueueAny.wake_idle_scheduler_cpus_lockedwalks eligible idle scheduler CPUs and stops only after the first one that accepts a fresh reschedule IPI; CPUs that already have a pending IPI (or that fail LAPIC delivery) are skipped without breaking, so a burst of ready work cross-wakes more than one neighbor for both queue and direct-target wakes.publish_created_threadno longer takes acaller_threadargument and no longer emits a per-CPU placement record: under the single global queue there is no per-CPU publish target, and hard-coding CPU0 misclassified normal worker publishes as single-owner-CPU0. Phase D later reintroduced the per-CPU split without restoring those publish counters; reintroduce them only through a separate operator-observability slice.Verified with `cargo build --features qemu` and `cargo build --features qemu,measure` (both warning-free) plus `make run-spawn` and `make run-smoke`. A post-collapse 3-run diagnostic `make run-thread-scale` on the benchmark VM (`taskset 0,1,2,3`, enforcement disabled) on 2026-05-02 10:42 UTC measured 1-to-2 work/total `1.890x`/`1.792x` (slight improvement over the pre-collapse 1-to-2) and 1-to-4 work/total `1.504x`/`1.436x` (clear regression vs the pre-collapse 1-to-4): single-queue scheduler-lock contention dominates at 4 workers. The numbers live in `docs/benchmarks.md` as diagnostic. Phase D later brought per-CPU queues back with a fair-share enqueue policy and formal accepted evidence (capOS plus Linux baseline, full enforcement, multiple runs, recorded host caveats). -
2026-05-02 07:00 UTC: Lifted endpoint-cancellation retry storage out of the scheduler lock. The
pending_endpoint_cancellations: VecDequefield is gone fromScheduler; it now lives in a dedicatedstatic PENDING_ENDPOINT_CANCELLATIONS: Lazy<Mutex<VecDeque<...>>>with boundedtry_reserve_exact(MAX_PENDING_ENDPOINT_CANCELLATIONS)reservation, eagerly forced ininit_idleviaLazy::forceso the allocation never lands in a timer/exit cleanup path. The queue’slen()under its own mutex is the single source of truth forpending_endpoint_cancellationsnon-emptiness. Producers (queue_pending_endpoint_cancellation,remove_pending_endpoint_cancellations_for_pid,remove_pending_endpoint_cancellations_for_thread) and the drain (drain_pending_endpoint_cancellations) take only the queue mutex; the scheduler lock is acquired only briefly insidequeue_pending_endpoint_cancellationto validate the target thread is live and has a ring scratch.defer_endpoint_cancellationpreviously re-acquired the scheduler lock just to push to the fallback queue; that re-acquisition is gone.`note_timer_slow_path_completed_locked` (consumer) holds the queue mutex across both the `!is_empty()` check and the `TIMER_SLOW_PATH_REQUIRED.store`, and the producer `queue_pending_endpoint_cancellation` stores `TIMER_SLOW_PATH_REQUIRED = true` inside the queue lock alongside its push, so a concurrent producer cannot push between the consumer's read and store and have its slow-path mark be overwritten. The functional contract is preserved: a cancellation that cannot deliver immediately because the target ring scratch is contended still falls back to the bounded retry queue, still raises `TIMER_SLOW_PATH_REQUIRED`, and is still drained on the next scheduler tick. Bound is unchanged (`MAX_PENDING_ENDPOINT_CANCELLATIONS = MAX_CAP_SLOTS * MAX_ENDPOINT_CANCELLATION_OBJECT_SWEEPS * MAX_ENDPOINT_CANCEL_NOTIFICATIONS_PER_ENDPOINT * SCHEDULER_CPUS`); the open size-tightening question (whether the `SCHEDULER_CPUS` multiplier is still load-bearing now that producers no longer hold the scheduler lock) is deferred to a future slice with bench evidence. A possible follow-on slice would move retry storage to per-endpoint bounded slots so each endpoint object owns its own queue, but that requires reshaping the `(thread, user_data)` payload to be addressable from an endpoint object and is non-trivial. The current move is sufficient to get the storage out of the scheduler lock and unblock future scheduler-lock-hold-time analysis. Verified with `cargo build --features qemu` and `cargo build --features qemu,measure` (both warning-free) plus `make run-spawn` and `make run-smoke` on 2026-05-02. Review found and fixed a Lazy-init in interrupt paths and a slow-path-clearing race against producer publication. -
2026-05-01 21:38 UTC: Feature-gated the first
ThreadCpuAccountingexperiment end-to-end behindcfg(feature = "measure"). That slice temporarily compiled the whole accounting record, its accessors, and scheduler call sites only when the feature was enabled. Phase D later superseded this temporary shape:runtime_ns,virtual_runtime_ns, andlast_started_nsare now unconditional normal-build fields because WFQ ordering,SchedulingPolicyCap.snapshot, andSchedulingContextbudget charging depend on them. The remaining diagnostic counters (context_switches,preemptions,voluntary_blocks,migrations,last_cpu, blocked/exited stability observations, placement buckets, and per-phase attribution counters) stay behindcfg(feature = "measure"). The 2026-05-01 slice was verified withcargo build --features qemuandcargo build --features qemu,measure(both warning-free) plusmake run-spawn(non-measure default) on 2026-05-01.make run-measurewas broken onmainat the time of this slice for unrelated reasons; that regression was repaired on2026-05-02 20:23 UTC(seedocs/backlog/scheduler-evolution.mdand thedocs/changelog.mdMeasure Mode Repair entry). -
2026-05-01 21:02 UTC: Retired the
RUNNABLE_PROCESS_EXIT_CLEANUP_PROOF_PRINTED,RUNNABLE_THREAD_EXIT_CLEANUP_PROOF_PRINTED, andCPU_ACCOUNTING_PROOF_PRINTEDonce-flag log lines along with theirAtomic*gating booleans, the threeprint_*_once/maybe_print_*_for_thread_lockedhelpers inkernel/src/sched.rs, and their four call sites. The runnable-cleanup invariants remain enforced by the unconditionalassert_no_runnable_pid_entry_lockedandassert_no_runnable_thread_entry_lockedpanics already inkernel/src/sched.rs; a regression that leaves stale runnable owner state still panics the kernel and failsmake run-spawn. Thetools/qemu-spawn-smoke.shharness lost its three matchinggrep -Fqlines for the same reason. The orphanedProcess::account_thread_exited_stable_observed/ThreadCpuAccounting::observe_exited_stablehelpers were deleted with the print; the remainingThreadCpuAccountingwrites stay untouched for the upcoming feature-gate slice. Thepub fn thread_cpu_accountingaccessor moved behindcfg(feature = "measure")because its only remaining caller is the measure-gatedaccount_thread_selected_lockedplacement counter bridge. -
Cache the active CPU id in the per-CPU GS-relative slot.
arch::percpu::current_cpu_idreads the LAPIC ID MMIO register and then linearly scansCPU_LAPIC_IDS[0..64]on every call. The timer fast-path consumer was retired on 2026-05-02 (see the “Retired the timer continuation fast path” entry above), but the function still runs from the syscall path and from non-syscall kernel contexts:arch::context::advance_bsp_tick, the scheduler’s CPU-slot accounting and dispatch lookups insched.rs,arch::tlb::flush_pending_for_current_cpu, andmem::paginginvalidation paths. The hot caller is the syscall entry path; the non-syscall callers are why a drop-in GS-relative replacement is harder than the cleanup item first suggested. The single-movlookup conceptually wantsmov %gs:offset, %eax, but the slice is blocked on a kernel-mode GS-base invariant: today the kernel setsKernelGsBaseviaset_kernel_gs_baseand only the syscall assembly doesswapgsto makegs:0..16resolve at PerCpu while handling a syscall. In normal kernel context (timer ISR, scheduler from non-syscall paths, paging init, AP bring-up), the active GS base is whatever Limine left, not the PerCpu address. A drop-in replacement ofcurrent_cpu_idwithgs:[offset]therefore faults outside syscall context (verified 2026-05-02: reorderinginit_bspto setKernelGsBasebeforeset_kernel_entry_stackis necessary but not sufficient because the active GS base is still not the PerCpu address). The enabling work is establishing a kernel-mode invariant that GS_BASE = PerCpu in CPL0 (typically byswapgs-ing on every kernel entry/exit, including interrupt handlers), or by adopting a hybrid: GS-relative read in the syscall path plus the existing LAPIC-based path everywhere else. Both paths are larger than a single retirement slice and should land with their own gates. Until then this item stays open andcurrent_cpu_idkeeps the LAPIC MMIO +CPU_LAPIC_IDSscan. -
Reassess the scheduler-lock-site instrumentation breadth.
SchedulerLockSite, theSchedulerLockGuard/measured_lockwrappers, the dualcfg(feature = "measure")scheduler_lock/scheduler_lock_sitepaths, and the eight per-site counter axes inkernel/src/measure.rswere added when the global scheduler lock was the suspected scaling bottleneck. After the runqueue/dispatch split landed and the documented per-CPU ownership invariants stabilized, decide which sites still justify dedicated counters and which should fold back into the aggregatescheduler_lockline. Keep thecfg(feature = "measure")gating; reduce the surface so reading the scheduler still reads as one lock acquisition path under non-measure builds. -
Reassess
single_cpu_owner_pids,direct_ipc_target, andhandoff_currentbefore Phase E starts. The single-owner pinning policy, the one-slot direct-IPC handoff, and the per-CPU handoff guard each special-case a small subset of the dispatch flow; document or delete each one against the accepted Phase D fair-policy behavior beforeSchedulingContextwork depends on it. Do not delete them speculatively: the cross-process IPC and process/thread exit cleanup proofs depend on the current direct-IPC and handoff invariants. -
Keep an honest scaling proof when scheduler work resumes. Completed
2026-05-02 21:38 UTCon the benchmark VM againstmaincommit374f8556. Five-run controlled paired evidence, both runs pinned to physical-core logical CPUs0,1,2,3on a 4-core/8-threadn2-highcpu-8host with KVM:| Comparison | capOS | Linux pthread | capOS gate | capOS verdict | | --- | ---: | ---: | ---: | --- | | 1→2 work | `1.883x` | `1.988x` | ≥ `1.6x` | accepted | | 1→2 total | `1.787x` | `1.987x` | ≥ `1.6x` | accepted | | 1→4 work | `1.566x` | `3.963x` | ≥ `1.6x` | diagnostic | | 1→4 total | `1.538x` | `3.858x` | ≥ `1.6x` | diagnostic | Linux scales near-linearly on the same physical CPU set (1-to-2 `1.99x`, 1-to-4 `3.96x`), so the workload shape is sound and the capOS 1-to-4 gap is a scheduler bottleneck, not a benchmark artifact. The 1-to-2 result was the formal accepted gate against the single-global-queue scheduler. The 1-to-4 result became the bottleneck-attribution diagnostic that justified Phase D's fair-share enqueue policy; Phase D later manually accepted the `2026-05-10` WFQ 1-to-4 diagnostic pair recorded above while the harness-enforced gates remained the 1-to-2 work/total speedups. Benchmark shape: blocking parent join, 262,144 blocks (16 MiB), `work_rounds=64`, 5 runs per case (the capOS harness default is 3 runs; this collection explicitly set `CAPOS_THREAD_SCALE_RUNS=5` for parity with the Linux baseline default). Host caveats: internal benchmark VM in a single GCP zone, status `RUNNING` during collection, machine `n2-highcpu-8` with nested virtualization enabled, `/dev/kvm` readable+writable without sudo, SSH operator account, kernel `Linux 6.17.0-1012-gcp x86_64`, CPU `Intel(R) Xeon(R) CPU @ 2.80GHz`, distinct physical-core layout (logical CPUs 0-3 are core IDs 0-3 thread 0; logical CPUs 4-7 are the SMT siblings), `qemu-system-x86_64 8.2.2`, `rustc 1.97.0-nightly (c935696dd 2026-04-29)`. Exact commands: ```sh # capOS PATH="$HOME/.cargo/bin:$PATH" \ CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \ CAPOS_THREAD_SCALE_RUNS=5 \ CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1 \ CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1 \ CAPOS_THREAD_SCALE_TIMESTAMP=20260502T213544Z \ make run-thread-scale # Linux pthread baseline PATH="$HOME/.cargo/bin:$PATH" \ LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \ LINUX_THREAD_SCALE_RUNS=5 \ LINUX_THREAD_SCALE_TIMESTAMP=20260502T213445Z \ make run-linux-thread-scale-baseline ``` Raw artifacts on the benchmark VM at `target/thread-scale/20260502T213544Z/` and `target/linux-thread-scale/20260502T213445Z/`. The instance was stopped after collection.