Scheduling

Scheduling decides which thread runs, preserves CPU state across preemption and blocking, and integrates capability-ring progress with process-owned execution resources.

Resource Semantics

The scheduler exposes distinct mechanisms that must not be collapsed into one “CPU quota” claim:

WFQ weight is relative share among runnable work.
SchedulingContext budget/period is generation-bound spend authority and a hard throttle at the current accounting granularity.
CpuIsolationLease controls CPU placement, exclusivity, and nohz eligibility; it does not grant CPU time.
fairness is relative arbitration, while a reservation or SLA additionally requires aggregate per-CPU feasibility admission and accounting for kernel, IRQ, SQPOLL, housekeeping, recovery, and bounded tick overshoot.

No general aggregate reservation admission exists today, so the current contexts and leases must not be described as minimum-service guarantees. See Resource Governance.

The scheduler stores shared process/thread metadata in Scheduler::processes: BTreeMap<Pid, Process>. Dispatch-owned runnable state lives in SchedulerDispatch: a per-CPU run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] array ordered ascending by Thread.virtual_finish_ns, per-CPU current and handoff_current slots, idle-thread slots, the direct-IPC target preference, run-queue reservation accounting, and deferred drop/stack release slots. Each live thread has at most one queued owner across all per-CPU queues combined, and every per-CPU queue reserves capacity up to the live runnable-capable thread count, so later timer, unblock, requeue, and steal-requeue paths do not allocate. The reservation is charged when the thread record is created – the same lock hold that makes it count toward live_thread_count() – not when the thread is later published as runnable, so whole-process teardown (which releases exactly live_thread_count() reservations) stays balanced even for a created-but-not-yet-published thread. The shared live-reservation count is released when processes or threads exit or when an unpublished created thread is rolled back. Reserving each queue to the full live-thread count is required because the bounded steal path may migrate every live thread into a single sibling queue between two scheduler passes.

Phase D accepted its Task 6 diagnostic closeout at commit 77caafc0 (2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate) and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d). The accepted state is the WFQ scheduler described here: per-thread weights and latency classes are mutated only through SchedulingPolicyCap, each per-CPU runnable queue is ordered by freshly derived virtual_finish_ns, migration preserves virtual_runtime_ns, and bounded stealing selects the most-overdue runnable sibling candidate. The controlled Task 6 benchmark pair on capos-bench recorded capOS 1-to-4 work/total speedups 3.088x / 2.700x versus the previous single-global-queue baseline 1.566x / 1.538x; the matching Linux pthread baseline on the same host and physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. The host harness enforced the configured 1-to-2 work/total gates; the 1-to-4 row was manually accepted from recorded diagnostics. Phase E SchedulingContext is the next scheduler authority phase; EEVDF is a follow-on ordering-policy evaluation rather than a Phase D blocker.

Phase D Task 3 (2026-05-07) restored the per-CPU runnable queues that the 2026-05-02 collapse retired and gave them the WFQ ordering Task 2’s virtual_finish_ns was prepared for. Newly created processes and threads prefer the creating scheduler CPU’s per-CPU queue; post-block wakes and preemption requeues likewise prefer the current CPU. Before the common queue-publication boundary computes the destination WFQ tag, it resolves that preference against static CPU eligibility. A process carrying process-wide endpoint or launch authority remains on CPU0, and the ordered-insert boundary asserts that no queue owner is published on an ineligible CPU. The resolved slot, rather than the preferred slot, is carried into WakePolicy::QueueCpu. Ordinary work therefore retains the local-first policy, while constrained work cannot become a permanent RetryLater entry on a sibling queue. The bounded steal path balances ordinary queues when other CPUs run out of local work. A more sophisticated caller-aware spread or least-loaded scan is a milestone-gate follow-up, not a Task 3 acceptance requirement. Wake policy carries WakePolicy::QueueCpu(u32) for endpoint, timer, park, process-wait, thread-join, and process-spawn completions so the wake target matches the queue placement, and DirectTarget keeps its original direct-IPC handoff role. The transitional CAPOS_SCHED_DISABLE_WFQ=1 / WakePolicy::QueueAny fallback has been removed before Phase E SchedulingContext schema work.

wake_idle_scheduler_cpus_locked first probes the placement target when the policy is QueueCpu, then walks eligible idle scheduler CPUs and wakes the first that accepts a fresh reschedule IPI, skipping CPUs that already have a pending IPI so a burst of ready work cross-wakes more than one neighbor instead of stranding the rest behind one already-targeted CPU.

Ring SQ Consumer Ownership

Each ring endpoint has kernel-owned SQ-consumer metadata outside the writable userspace ring page. cap_enter and the bounded timer-side current-thread ring service both acquire a syscall-mode owner lease before calling process_ring(). The lease carries a nonzero generation and owner identity; process_ring() verifies that generation before flushing deferred ring work or advancing SQ head, and stale owners return StaleSqConsumer without consuming the head SQE. Duplicate owners fail closed as a retryable busy cap_enter status.

CQ publication remains independent of SQ ownership. Already accepted completions stay visible through CQ head/tail even after the SQ owner releases, and thread/process teardown releases any live SQ owner before ring unmapping or record drop without clearing accepted CQEs.

Bounded SQPOLL ring mode

Phase F adds a bounded SQPOLL mode for the caller thread’s ring through CpuIsolationLease with allowedMode = kernelSqpoll and namedRing = callerThread. The transition is explicit: syscall-owned dispatch may request SQPOLL start while it still owns the SQ, then releases its generation-checked owner; the poller finalizes into SqpollRunning, may publish NEED_WAKEUP and enter SqpollSleeping, wakes back to running when a producer publishes a new SQ tail, and stops or rolls back on lease revoke, cap release, teardown, or failed start. Timer-side syscall-mode ring service fails closed while SQPOLL owns the same endpoint, so no second SQ consumer can advance the SQ head.

The Phase F poller runs from the periodic scheduler service path and from a bounded current-thread syscall service entry used for SQPOLL producer wakes and explicit syscall kicks. Both entries borrow the SQPOLL owner lease rather than acquiring syscall SQ ownership. The current default admits two SQEs per selected SQPOLL worker, and a worker is not reselected again in the same periodic service pass or syscall service entry. Poller elapsed time is charged to the admitted scheduler ledger or scheduling-context target. The wake/sleep protocol uses a shared ring flag: the poller publishes NEED_WAKEUP, performs a full ordering barrier, and rechecks SQ tail before sleeping; producers publish initialized SQEs, store SQ tail with a barrier, and enter the kernel if NEED_WAKEUP is visible. A cap_enter producer wake that finds SQPOLL already owns SQ head can run one bounded SQPOLL batch, return visible CQ availability when the requested threshold is satisfied, preserve ordinary blocked-current-thread and thread-owned-head results, and otherwise fail closed as a retryable busy result. Stale owner generations fail before deferred ring work or SQE start. If teardown requests stop after a live owner has already accepted a SQE, the poller still publishes SQ head for that accepted SQE before releasing ownership, preserving accepted CQEs without leaving work replayable by syscall mode. The focused make run-scheduler-generic-sqpoll-nohz proof admits this explicit ring-coupled shape into SQPOLL nohz, drives producer wake and bounded service progress without depending on a periodic tick, then rolls back on stale owner/lease revoke. The userspace AutoNoHz policy daemon now issues bounded compute leases from manifest-declared profiles. Cross-process target discovery, broader userspace-poller/device-queue admission, and production realtime admission remain future work.

Per-CPU run queue ordering structure

Each per-CPU VecDeque<ThreadRef> is kept ordered ascending by Thread.virtual_finish_ns. Enqueue performs an ordered insert via a linear scan from the front; selection scans the queue by index for the first destination-Runnable entry (via pop_first_runnable_local_locked), removes Drop entries it walks past, and leaves RetryLater entries undisturbed for the next scheduler pass. Because the queue is ordered ascending, the first Runnable hit is also the lowest-virtual_finish_ns candidate the destination CPU can accept (the most overdue against fair share that this CPU is allowed to run). Linear-scan insert is O(n) per enqueue; with SCHEDULER_CPUS = 4 and bounded thread counts in this slice the constant is small enough to defer a smarter structure (sorted bucket arrays, intrusive trees) until benchmark evidence shows it dominates scheduler-lock hold time. Promoting to a smarter structure is a follow-up under this plan if the Task 6 milestone gate proves the need.

virtual_finish_ns is recomputed on every enqueue from the thread’s current virtual_runtime_ns, weight, and latency_class; it is never carried as committed state across blocking, and migrations between per-CPU queues recompute it at the destination so the destination’s view of fair-share progress applies. The derivation rule per latency class is documented in capos-abi/src/scheduler.rs and the “Latency-class semantics for Phase D” section of docs/proposals/scheduler-evolution-proposal.md.

Bounded steal path

When a CPU’s local queue has no immediately runnable entry the scheduler walks sibling per-CPU queues. For each sibling queue the scan walks indices ascending and selects that queue’s first entry that the destination CPU considers Runnable; because each queue is ordered ascending by virtual_finish_ns, the first Runnable hit is also the lowest virtual_finish_ns candidate available to the destination on that source queue. The steal then picks the source queue whose first-Runnable candidate has the lowest virtual_finish_ns overall, with ties broken by lower CPU id. The chosen entry is removed from its current position in the source queue (not necessarily the head: a RetryLater or single-CPU-owner thread may sit at the source’s front and stay there), the WFQ tag is recomputed at the destination, and the entry is inserted at the destination’s ordered position. The destination queue is reserved to the full live-thread count, so the steal-requeue is allocation-free. The scan walks at most SCHEDULER_CPUS * max_queue_len entries, but in practice each sibling scan stops at the first Runnable candidate per queue.

RetryLater semantics in the local scan

The local pop scan walks the per-CPU queue by index instead of popping the front and re-pushing RetryLater candidates. Re-pushing a RetryLater entry whose virtual_finish_ns has not changed would ordered-insert it back at the same head position, so a naive pop-then-requeue loop would re-pop the same RetryLater head every iteration and starve runnable entries behind it. The index scan removes Drop entries in place, leaves RetryLater entries undisturbed for the next scheduler pass to re-evaluate, and returns the first Runnable candidate it finds. The bounded steal path uses the same index scan on the destination queue after a steal so a stolen RetryLater entry does not get re-popped in the same dispatch pass.

Phase E preflight fallback cleanup

The one-bisect-cycle CAPOS_SCHED_DISABLE_WFQ=1 opt-out has been removed. Enqueues always target the selected per-CPU WFQ queue, and wake-up sites always carry WakePolicy::QueueCpu(slot) for queued work. Phase E SchedulingContext work therefore starts from the accepted Phase D WFQ behavior rather than from a source-level single-global-queue fallback.

Phase E Task 1: scheduling-context object shape

The first SchedulingContext slice is info-only: schema, config, runtime, and kernel code expose SchedulingContext.info() and a bootstrap grant shape, but no dispatcher enforcement, replenishment, donation/return, depletion notification, realtime island, SQPOLL, or nohz behavior. SchedulingContextSpec.cpuMask uses the canonical little-endian bitset defined in schema/capos.capnp: CPU n maps to bit n % 8 of byte n / 8, with bit 0 as the least-significant bit of that byte. Empty data means no CPUs are selected rather than all CPUs. Producers omit trailing zero bytes, so the all-zero set’s canonical form is empty and any non-empty canonical mask ends with a nonzero byte.

Phase E Task 2: bind, revoke, and generation identity

The second SchedulingContext slice adds the first bounded authority lifecycle. SchedulingContext.create() creates a same-interface result cap for a validated spec, bindCallerThread() records one caller-thread binding for the current context generation, and revoke() advances the generation and clears the matching thread metadata binding. Bootstrap-granted contexts and contexts returned by create() use the same non-wrapping context-id allocator; the binding identity remains (contextId, generation), but distinct cap objects no longer share bootstrap ids. Stale caps report staleGeneration and cannot create, bind, or revoke scheduler metadata for a new generation; already-revoked contexts report revoked. Release cleanup clears only a thread metadata binding that matches the released cap identity.

Phase E: SchedulingContext budget enforcement

make run-scheduling-context is the focused Phase E QEMU proof. It starts one process with two independently granted bootstrap contexts, verifies their identities cannot alias, adopts a created result cap, drives bind/revoke and stale-generation calls, confirms release cleanup by rebinding after the released cap drops, and now checks the first dispatcher budget behavior. bindCallerThread() installs a fixed budget ledger in the caller thread’s scheduler metadata. Runtime charge decrements that ledger at the same scheduler-lock-contained points that update per-thread runtime/vruntime. Runnable selection replenishes elapsed periods and treats exhausted bound contexts as RetryLater until their next period, leaving the queued owner in place rather than allocating or moving emergency-path state. Stale or revoked contexts still fail closed before mutating scheduler metadata or accounting.

The current enforcement granularity is the existing periodic scheduler tick: a running thread may overshoot its budget by the current tick quantum before the next dispatch charge throttles it. The smoke therefore proves bounded dispatcher behavior, not nohz/SQPOLL activation or hard realtime admission. It prints dispatch_effect=budgetEnforced, visible budget charge, replenishment to full budget after a period, and a throttled wall-clock window.

Phase F: CpuIsolationLease and automatic nohz activation

CpuIsolationLease is a separate authority surface from SchedulingContext CPU-time budget enforcement. The scaffold records owner identity, allowed CPU set, allowed isolation mode, live accounting target reference, housekeeping exclusions, maximum revocation latency, and generation identity. It rejects stale generations, duplicate or overlapping active leases, fabricated or stale SchedulingContext accounting targets, malformed CPU masks, and lease sets that would leave no online scheduler housekeeping CPU outside the globally admitted active lease CPUs.

The scheduler-side preflight reports a bounded nohz activation/deactivation decision surface: lease identity, target CPU mask, target runnable entity count, active housekeeping CPU availability after subtracting all active lease CPUs, selected housekeeping CPU mask, deferred cleanup, timer/deadline, network polling, IRQ-affinity, accounting-target, monotonic clocksource/accounting readiness, one-SQ-consumer, revocation latency, rollback, and periodic-fallback labels. The accepted QEMU proof uses -smp 4 so an active lease can report ready housekeeping CPUs outside the target CPU, selected housekeeping placement, and exactly one runnable caller on that target CPU.

The clockevent/deadline substrate uses a calibrated TSC-backed monotonic clocksource on normal QEMU/x86_64, with the periodic LAPIC tick disciplining the TSC epoch so QEMU guest halt windows cannot stall wall-clock progress. Timer.sleep, finite cap_enter, and park timeouts store absolute monotonic deadline_ns values, and the LAPIC clockevent backend can program a bounded one-shot deadline and restore periodic mode.

Automatic nohz activation state machine

When the preflight finds every proof obligation satisfied – a single runnable entity on the target CPU, a ready housekeeping CPU outside the lease, no local deferred-cleanup/timer dependency, a valid accounting target, a live monotonic clocksource, a non-stale one-SQ-consumer when a ring is named, a bounded revocation latency, and the lease’s allowedCpuMask naming exactly one scheduler-owned CPU – it performs real per-CPU periodic-tick suppression for that narrow single-runnable window. The target CPU may be the CPU running the preflight call (local activation) or a different scheduler CPU (remote-CPU activation via a reschedule IPI – see Remote-CPU activation below). The single-runnable shape differs by target: a local activation requires the caller itself to be that single entity (exactly-one-runnable-caller); a remote activation requires the target CPU’s single runnable entity to be some thread pinned there, not the caller (which runs on a different CPU – exactly-one-runnable-remote-target).

Admission gates. Two lease shapes can be admitted for tick suppression: a pure namedRing = none compute lease, and a ring-coupled allowedMode = kernelSqpoll lease whose bound ring is being actively driven by a live SQPOLL consumer.
- Compute lease (namedRing = none). Declares no local network/IRQ dependency, so the read-only network-polling and IRQ-affinity admission gates pass.
- Ring-coupled SQPOLL lease (allowedMode = kernelSqpoll, namedRing = callerThread). The lease’s declared kernel-polled work IS the bounded SQPOLL ring poller, which the scheduler keeps progressing through cap_enter/producer-wake even while the periodic tick is masked. The preflight admits it only when the bound ring is in SQPOLL running/sleeping mode with a non-stale Sqpoll owner; the one-SQ-consumer label is then blocked-sqpoll-owner (the worker owns the ring). The preflight ring-state read is a best-effort hint – it never takes the per-ring lock inside the scheduler lock (it uses try_lock, and a contended snapshot does not admit activation). The decisive disqualifier is the IPI/timer re-check below.
- A namedRing = callerThread lease that is not kernelSqpoll (compute-with-ring) keeps the conservative refusal until network polling and IRQ affinity are routed to a housekeeping CPU, as does any device-owning mode. The kernel still services virtio RX/TX and Interrupt waiters inline from the periodic scheduler path.
Activate. The preflight masks the periodic LAPIC timer on the current CPU and arms a one-shot deadline at min(nearest pending timer wakeup, now + max revocation latency). The CPU now runs on a bounded one-shot deadline instead of the periodic tick. The eligible lease generation is registered so revoke/cleanup paths can stale it.
Re-check. On every timer interrupt and on every reschedule IPI the handler re-checks the activation window before the scheduler picks the next thread. The reschedule-IPI handler also drains any pending remote-CPU activation request parked for this CPU (the IPI vector is shared with the remote-activation path – see Remote-CPU activation below), and the periodic timer handler drains it too as a backstop. An unchanged eligible window re-arms the bounded one-shot deadline; a reschedule IPI (the prompt signal that another CPU woke runnable work onto this CPU) drives an immediate rollback. The re-check runs in interrupt context and uses try_lock to avoid deadlocking against a held scheduler lock. Armed-timer invariant: the masked-periodic one-shot does not auto-rearm, so a timer-interrupt re-check NEVER returns leaving a tickless CPU without an armed timer – on scheduler-lock contention it arms a bounded minimum-delta fallback one-shot (or restores the periodic tick) before returning. A lock-free per-CPU nohz-active bitmask lets the contention path distinguish a tickless CPU (the consumed timer was the nohz one-shot and must be replaced) from a normal CPU (the periodic tick auto-rearms). A reschedule IPI does not consume the one-shot, so its contention skip is safe – the still-armed one-shot bounds the next re-check.
Rollback. Any disqualifying change rolls the CPU back to the periodic LAPIC tick first, before any further ordinary work: a stale lease generation (explicit revoke, process exit, service replacement, session logout), a second runnable entity or stealable sibling work on the target CPU, a local deferred-cleanup dependency, a direct-IPC target becoming runnable, a target-CPU mismatch, or a one-shot backend that can no longer arm a deadline. For a ring-coupled SQPOLL activation the re-check also carries a sqpoll-ring-mode-changed-or-owner-staled disqualifier (the bound ring leaving SQPOLL running/sleeping mode or its owner staling); that re-check runs under the scheduler lock and uses try_lock on the per-ring lock, so a contended ring is treated as disqualifying (fail-closed – restore the periodic tick rather than keep a CPU tickless on an unverifiable ring). That SQPOLL ring-mode branch is defense-in-depth, currently subsumed by lease-generation staling: every reachable SQPOLL-stop path today (stop_sqpoll_for_lease / stop_sqpoll_if_owned) is a revoke/cleanup-path caller that also stales the lease, and stale-lease-generation is checked first – so the lease-generation stale is the load-bearing SQPOLL rollback trigger in practice. The SQPOLL ring-mode branch becomes independently load-bearing, and would then need its own proof, only if a future change introduces a SQPOLL-stop path that keeps the lease live. Runtime accounting stays boundary/counter driven and monotonic, so suppressing the tick never strands SchedulingContext budget charging.

Remote-CPU activation

Masking the periodic LAPIC tick and arming the one-shot deadline are per-CPU operations – only the target CPU can program its own LAPIC timer. When the preflight runs on CPU A but the lease’s single-CPU allowedCpuMask targets a different CPU B, the kernel does not refuse: it parks a bounded remote-activation request in CPU B’s per-CPU slot and sends a reschedule-style IPI to CPU B. CPU B drains the request from its IPI handler (and from its periodic timer handler as a backstop), re-runs the full disqualification check locally under its own scheduler-lock acquisition, and only then arms its own one-shot deadline. A remote activation is never trusted blind – the preflight’s eligibility snapshot was taken on a different CPU and may be stale by the time the IPI is drained, so the target CPU re-checks before committing. The relevant invariants:

Bounded request slot, no nesting. The pending-request store is a fixed [Option<_>; SCHEDULER_CPUS] array – one single-entry slot per CPU, so it can never grow unbounded. If a slot already holds an undrained request, a new preflight fails closed (rejected) rather than queuing behind it. The IPI-context drain never nests the scheduler lock: it takes only the small per-CPU slot mutex, then calls the activation in try_lock mode.
Contention retry. If the IPI-context drain finds the scheduler lock contended, it leaves the request parked and returns; the target CPU’s next periodic timer tick (still live – the tick has not been suppressed) retries the drain. Progress is bounded by the periodic tick the same way the existing local re-check contention path is.
Fail-closed IPI ordering. A remote rollback (rollback_nohz_for_lease) stales the lease generation before clearing the activation record. The drain re-checks the generation before arming, so a rollback that races the drain fails closed (the request is dropped, the periodic tick stays live). If the drain already committed before the rollback cleared the record, the target CPU’s next nohz_recheck sees the nohz-active bit set with no record and restores its periodic tick. Either ordering converges on the periodic tick.
Compute-only. Remote-CPU activation is limited to namedRing = none compute leases in this slice. A ring-coupled SQPOLL lease whose target differs from its ring owner’s CPU is not an admitted shape; it fails closed.

Generic full-nohz admission for ordinary budgeted compute threads is available only through an explicit SchedulingContext-targeted compute lease and the same fail-closed placement gates described above. The SQPOLL nohz state machine now admits explicitly leased caller-thread rings when the SQPOLL worker is live, single-consumer, and bounded by producer wake/deadline rollback. Broader userspace-poller/device-queue admission, automatic CPU-isolation issuance, and production realtime island admission remain future work; auto_nohz stays disabled. Timeout-based auto-revoke landed 2026-05-30 15:22 UTC: a CpuIsolationLease created with leaseLifetimeNs > 0 records an absolute expiry deadline, auto-revokes through the existing generation-advancing cleanup on first observation past it (reason=lease-expired), and the nohz activation record carries the lifetime deadline so a tickless CPU rolls back at the next timer/IPI recheck (lease-lifetime-expired disqualifier), bounded by maxRevocationLatencyNs. A leaseLifetimeNs of 0 preserves the prior revoke/cleanup-only lifecycle. The current SQPOLL-driven activation is the bounded case: tick suppression for a ring-coupled kernelSqpoll lease on the CPU running the preflight, rolled back through lease-generation staling on revoke/cleanup, with the SQPOLL ring-state re-check as defense-in-depth for any future SQPOLL-stop path that does not stale the lease.

Lease revocation and cleanup are generation-aware. Explicit revoke, process exit, service replacement through process termination, and session logout stale the matching generation so old caps cannot keep isolation eligibility alive, and rolling the matching lease’s active nohz window back to the periodic tick is part of the same cleanup path. make run-scheduler-cpu-isolation-lease is the broad QEMU proof for grant, info, revoke, cleanup, real nohz activation and fail-closed rollback, bounded SQPOLL start/sleep/stop, rollback labels, generic full-nohz, and SQPOLL nohz. make run-scheduler-generic-sqpoll-nohz is the focused SQPOLL proof for eligible ring admission, producer wake, SQPOLL service, rollback, and stale owner rejection.

Phase E: endpoint donation and return

Synchronous endpoint delivery now carries a bounded internal donation token when a caller thread with a bound active SchedulingContext delivers a CALL to a receiver thread that has no scheduling context of its own. Donation is strictly passive-server shaped: receivers that already have a scheduling context keep their own authority, unbound callers donate nothing, and callers that receive a donation token are blocked from returning to userspace until the in-flight endpoint call returns or is canceled.

At delivery, the scheduler charges pre-donation caller runtime before moving the context ledger to the receiver. While the receiver handles the endpoint message, normal dispatcher runtime charging decrements the donated context. When endpoint RETURN commits the caller completion, the scheduler first charges receiver runtime since dispatch, then returns the remaining budget and next-replenishment state to the caller’s thread metadata and rebinds the SchedulingContext record to the caller. Return preflight failures leave the in-flight donation in place, while application-exception RETURN, invalid-result RETURN errors, delivery failure, return cancellation, endpoint teardown, process/thread exit, and stale-caller cleanup return or clear the donation before waking the caller and without allocating new emergency-path storage. Nested donation of an already donated context is rejected; supporting stacked donation is deferred until it has an explicit return-token stack design.

make run-scheduling-context proves the behavior with a same-process endpoint round trip. The caller binds a fresh context, burns CPU immediately before CALL, the passive server burns CPU while servicing the endpoint CALL and again immediately before RETURN, and after RETURN the caller observes the reduced budget restored. The same smoke covers application-exception RETURN, oversized-result RETURN under donation, and deterministic rejection of A-to-B-to-C nested donation. It also submits a delivered donated CALL and then uses cap_enter(0, 0) while the server delays RETURN, proving the donor cannot continue outside the donated ledger. A fast-return variant covers the race where the receiver returns before the caller commits to the donation-blocked scheduler state. The smoke prints endpoint_donation=ok, endpoint_return=ok, endpoint_exception_return=ok, endpoint_invalid_return=ok, endpoint_nested_rejected=ok, endpoint_donor_block=ok, endpoint_donor_fast=ok, endpoint_donation_server, endpoint_donation_after, endpoint_exception_return_after, endpoint_invalid_return_after, endpoint_nested_after, endpoint_donor_block_elapsed_ns, endpoint_donor_block_after, endpoint_donor_fast_elapsed_ns, and endpoint_donor_fast_after.

Phase E: SchedulingContext notifications

Every SchedulingContext now owns fixed notification storage allocated at context creation or bootstrap. The storage has two coalescing slots: budgetDepleted and deadlineOrTimeout. Each slot records context id/generation, a saturating sequence, a saturating coalesced-event count, the last holder thread, remaining budget, the next replenishment/deadline timestamp, and whether the holder was using an endpoint-donated context. Runtime charge records depletion when remaining budget transitions to zero and records deadline/timeout expiry against the same context generation. Failed bind attempts do not arm a new budget/deadline window.

SchedulingContext.drainNotifications() returns typed observer results: ok drains the matching fixed cells, revoked reports the current revoked generation, and staleGeneration reports an old observer generation without draining the current record. Explicit revoke() records an explicitRevoke lifecycle event. These notifications explain already-enforced scheduler state; they do not donate budget, reorder runnable entities, bypass throttling, publish result caps, append unbounded queues, allocate on scheduler hard paths, or imply auto-nohz/SQPOLL/tickless behavior. A pre-armed observer waiter/wakeup path remains a future extension.

make run-scheduling-context proves the notification slice by repeatedly draining a depleted context after coalescing, observing deadline expiry, recording explicit revoke and stale-observer labels, and confirming that endpoint-donated runtime records notification state on the donated context. The smoke prints notification_coalescing=ok, deadline_notification=ok, revoke_notification=explicitRevoke, stale_notification=staleGeneration, and endpoint_donated_notification=ok.

Phase E: session logout lifecycle hook

UserSession.logout() now notifies the scheduler after the session liveness cell transitions from live to logged out. That covers explicit UserSession.logout() calls, including the remote DTO gateway logout command and connection-teardown path because those paths already call the same kernel UserSession.logout() method. The hook scans scheduler-owned process/thread metadata for live processes whose immutable SessionContext shares the logged out liveness cell, removes each non-donated matching thread binding from the scheduler ledger, and asks the bound SchedulingContext record to advance its generation and mark itself revoked. Old ordinary SchedulingContext grants therefore report stale generation through info() with zero visible remaining budget and InfoOnlyNoDispatchChange. The focused session-context smoke also proves stale bindCallerThread() does not rebind, stale create() does not publish a result cap, stale revoke() does not mutate the current metadata generation, and stale notification draining reports a stale observer result.

The hook intentionally does not use session code as a second scheduling-context ledger: session lifecycle code only flips liveness and notifies the scheduler, and the scheduler owns the scan and binding removal. The scan takes one binding at a time under the scheduler lock, drops that lock, then calls the SchedulingContextExitCleanup record hook so it does not invert the existing SchedulingContext record-lock to scheduler-lock order used by bindCallerThread().

In-flight endpoint donation uses a conservative counted/skipped logout policy. If the logged-out session owns a receiver thread that currently holds a donated context, the logout hook records that the donated binding was skipped rather than returning donor budget while the endpoint call remains in flight. The focused session-context smoke proves the donor remains blocked in cap_enter(0, 0) until the receiver returns, the hook reports donation_inflight_skipped=1, and endpoint RETURN removes the receiver binding while restoring only the reduced remaining budget to the donor. This does not add a new logout-triggered cancellation semantic. Local owner-shell exit now calls the held UserSession.logout() before clean shell process exit, so the same scheduler hook observes shell logout with stale_marked=0 donation_inflight_skipped=0 in the shell smoke. The ordinary bound-context stale proof remains the focused session-context smoke, because the normal shell does not hold a bound SchedulingContext. Process and thread exit cleanup already have their own stale-context coverage and are unchanged.

Realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain future Phase F/G work.

Phase D Task 4: migration fairness invariants

Phase D Task 4 (2026-05-08) made three migration-fairness invariants explicit:

virtual_runtime_ns travels with the thread. It lives on Thread.cpu_accounting, not on a per-CPU slot, so a migration from CPU A to CPU B preserves the thread’s accumulated weighted-fair share. The accounting field was promoted out of cfg(measure) in Task 2 and continues to advance through charge_runtime regardless of which CPU charges the quantum.
virtual_finish_ns is derived per enqueue, never committed. Every enqueue site – the initial publish in enqueue_ready_thread_on_slot_locked, the post-block requeue in enqueue_unblocked_thread_on_slot_locked, and the steal-insert in steal_from_sibling_queues_locked – routes through refresh_virtual_finish_ns_locked, which reads thread.weight, thread.latency_class, and thread.cpu_accounting.virtual_runtime_ns fresh and recomputes the WFQ ordering tag. The field is never carried as committed state across blocking and is never carried with the thread on migration; the destination CPU’s view of weight, latency class, and quantum decides the new tag.
Steal recomputes at the destination. The pop-from-source step in steal_from_sibling_queues_locked is followed by refresh_virtual_finish_ns_locked against the destination slot before the ordered insert, so a SchedulingPolicyCap.setWeight that landed between source enqueue and steal takes effect at the steal itself.

Sleeper-credit min-vruntime floor

virtual_runtime_ns advances only while a thread runs and is frozen across any block, and a fresh thread starts at virtual_runtime_ns == 0. Without a floor, a thread returning from a long voluntary block (or a just-spawned thread) re-enqueues with a virtual_finish_ns far below every runnable sibling, so pop_first_runnable_local_locked keeps selecting it until its vruntime catches up – an unbounded, sleep-duration-proportional catch-up burst that starves CPU-bound siblings sharing the run queue. This is the classic CFS sleeper problem, bounded by CFS’s place_entity min-vruntime floor.

Process::refresh_thread_virtual_finish_ns takes a min_vruntime_floor_ns and clamps the thread’s virtual_runtime_ns up to it (persisting the clamp, matching place_entity’s write-back of se->vruntime) before deriving the WFQ tag. The scheduler computes the floor in run_queue_min_vruntime_floor_locked as the minimum virtual_runtime_ns among the runnable siblings already competing for the target CPU – every thread on that per-CPU run queue plus the thread currently running there, excluding the enqueuing thread itself and the idle threads – less a bounded sleeper credit (WFQ_SLEEPER_CREDIT_NS, one scheduler tick). A target CPU with no other runnable non-idle work yields a floor of 0 (no clamp), so a thread waking onto an idle CPU is never penalized. Persisting the clamp is required: a non-persisting clamp would leave the woken thread as the run queue’s perpetual minimum, so it would re-qualify for the floor on every requeue and never catch up.

The floor is applied only on the spawn and post-block wake enqueues (enqueue_ready_thread_on_slot_locked and enqueue_unblocked_thread_on_slot_locked pass apply_sleeper_floor = true). The preemption-requeue path (enqueue_ready_thread_locked, the sole caller from the timer preemption arm) and the steal/migration path pass false: a still-runnable thread’s virtual_runtime_ns is legitimate accrued fair-share state – a heavy-weight thread is intentionally below its siblings – so clamping it there would erase its weighted-fair lead. This distinction mirrors CFS applying place_entity only on ENQUEUE_WAKEUP/fork, never on a plain requeue. Steady-state accrual on run is unchanged; only the placement of an under-floor woken or fresh thread moves.

The synchronous direct-IPC handoff (ready_thread_for_direct_ipc / WakePolicy::DirectTarget) is intentionally outside the floor’s scope: it dispatches the just-unblocked callee straight onto the CPU, bypassing the run queue entirely, as a scheduling-context donation from the caller rather than fair-queue contention. There is no run-queue placement to floor, and the callee’s next ordinary enqueue goes through the floored wake path.

Direct IPC timing and priority inversion

The measure-only scheduler counter direct_ipc_ready_to_select records guest TSC cycles from ready_thread_for_direct_ipc publishing a direct target to choose_next_locked selecting it. It deliberately excludes endpoint work and context restore. The five-run nested-QEMU record, including its retained outlier and environment limits, is in Direct IPC Handoff Timing.

Direct selection is a latency optimization, not a priority-inheritance policy. The reachable inversions and their current bounds are:

Risk and reachable interleaving	Current bound	Bound required before policy expansion
WFQ bypass on direct CALL delivery. In `kernel/src/sched.rs`, a low-weight caller wakes a receiver through `ready_thread_for_direct_ipc` while a higher-weight runnable thread is already queued. `choose_next_locked` consumes the direct target before consulting the WFQ queue, without comparing weight or latency class.	The scheduling bypass lasts for the selected receiver’s current dispatch. With another runnable thread present, the normal periodic timer limits that dispatch to one `TICK_NS` quantum (currently 10 ms) before ordinary WFQ requeue. This bounds the initial scheduler bypass, not endpoint completion.	Direct-handoff admission must compare effective scheduling priority or carry explicit, bounded donation metadata. Any inherited state needs deterministic restoration and a finite chain bound.
FIFO endpoint head-of-line blocking. In `kernel/src/cap/endpoint.rs`, `Endpoint::endpoint_call` appends unmatched calls and `Endpoint::endpoint_recv` removes from the front. A high-priority caller can therefore queue behind a lower-priority call, or wait while the receiver services a lower-priority in-flight call.	`ResourceProfile` limits are clamped by `MAX_QUEUED_CALLS = 32` and `MAX_IN_FLIGHT_CALLS = 32` in the same file, bounding storage and the number of calls. No method deadline or service-time bound limits how long the head call can hold the receiver.	A future policy needs bounded cancellation/deadline semantics and priority-aware admission or ordering while preserving fixed storage bounds.
Budget donation without WFQ inheritance. In `kernel/src/sched.rs`, `donate_scheduling_context_for_endpoint_call` moves an active context’s budget, replenishment, and deadline state to a passive receiver. `Process::install_thread_scheduling_context_for_endpoint_donation` in `kernel/src/process.rs` does not change the receiver’s `Thread` weight or latency class. After the initial direct dispatch, the receiver competes under its own WFQ policy even when the caller has higher weight.	Budget depletion and deadline notifications constrain or report use of the donated ledger, but they do not bound service completion or force the receiver to run ahead of competing work.	Scheduling-context donation metadata must explicitly define whether and how weight/latency are inherited, capped, restored, and composed with budget enforcement.
Receiver-owned context suppresses donation. `Process::install_thread_scheduling_context_for_endpoint_donation` in `kernel/src/process.rs` rejects an already-bound receiver. `donate_scheduling_context_for_endpoint_call` in `kernel/src/sched.rs` then restores the donor binding, so the receiver continues on its own budget and WFQ settings while the caller remains blocked on the endpoint result.	The receiver’s own budget and replenishment period bound its admitted CPU use, but there is no end-to-end completion bound for the caller.	Policy must explicitly choose rejection, ceiling inheritance, or another bounded rule for receiver-owned contexts; silent override is not acceptable.

RETURN does not repeat the direct-target WFQ bypass. CAP_OP_RETURN in kernel/src/cap/ring.rs reaches wake_cap_waiter_if_satisfied; that function calls ready_thread in kernel/src/sched.rs, which places the caller on the ordinary WFQ queue through enqueue_unblocked_thread_on_slot_locked.

Two more severe shapes are not reachable in the current implementation. Donor and receiver cannot both spend one donated ledger: the scheduler removes the binding from the donor, marks the donor donation-blocked, and restores or clears the binding through return_scheduling_context_from_endpoint_call and the cancellation paths in kernel/src/sched.rs. Unbounded nested donation is also rejected: Process::take_thread_scheduling_context_for_endpoint_donation in kernel/src/process.rs refuses a binding whose endpoint_donation_return is already populated. The run-scheduling-context smoke proves both donor blocking and deterministic A-to-B-to-C rejection. These fail-closed rules prevent double spending and unbounded donation chains, but do not solve the reachable inversions above.

Proof: make run-thread-fairness-sleeper-floor (manifest system-thread-fairness-sleeper-floor.cue, demo mode sleeper-floor in demos/thread-fairness/). On a single CPU, two CPU-bound siblings run while a third worker is both freshly spawned and returns from a long Timer.sleep; the demo measures each worker’s runtime over the sleeper’s post-wake window and self-asserts the sleeper stays within a fair-share ceiling while the CPU-bound siblings stay above a starvation floor. Without the min-vruntime floor the sleeper monopolizes the whole window and starves the siblings; the smoke trips on either the monopoly or the starvation line.

Migrations counter shape

ThreadCpuAccounting.migrations is cfg(feature = "measure")-gated and remains a benchmark-only operator-observability counter; it is not load-bearing for ordering and is not exposed through SchedulingPolicyCap.snapshot. Phase D Task 4 moved the increment from the dispatch-time scheduled_measure path to two enqueue-time arms in kernel/src/sched.rs:

Placement-time spread (record_placement_spread_migration_locked) fires from push_reserved_run_queue_locked when the enqueue target slot differs from the thread’s previously dispatched CPU (ThreadCpuAccounting.last_cpu). A thread that has never been dispatched (last_cpu == None) does not register a migration on first publish; otherwise placement spread is counted exactly once per enqueue.
Steal (record_steal_migration_locked) fires from steal_from_sibling_queues_locked after the source-queue removal and before the destination-queue insert. The steal scan skips the destination slot, so the counter increments unconditionally each time the steal arm is reached.

scheduled_measure still maintains last_cpu so the placement-spread check has the previous CPU available; only the migrations++ moved. The pre-collapse counter shape is preserved in steady state – a thread that runs on a different CPU than its previous run still records exactly one migration – but the increment is now attributed to the enqueue decision (placement spread or steal) rather than the dispatch that follows it.

The aggregate process-wide thread_placement counter family in kernel/src/measure.rs (migrations, migration_to_cpu0..3, consumed by tools/qemu-thread-scale-harness.sh) is a separate measurement device. It is incremented from account_thread_selected_locked at dispatch time and continues to observe “thread ran on a different CPU than its previously dispatched CPU” rather than the per-thread Task 4 enqueue-time shape, so the thread-scale harness regex does not need to change. The per-thread ThreadCpuAccounting.migrations field and the aggregate thread_placement counter intentionally measure different events at different points in the scheduling pipeline; both stay behind cfg(feature = "measure").

Phase H: per-thread saturation status surface

The Phase H AutoNoHz placement heuristic, implemented by demos/autonohz-policy-daemon, reads per-thread saturation observation in the normal dispatch build, not only under cfg(feature = "measure"). The non-measure per-thread saturation status surface (2026-05-30) promoted the inputs it consumes into ordinary ThreadCpuAccounting state and exports them through SchedulingPolicyCap.snapshot @2:

voluntary_blocks and preemptions moved out of cfg(feature = "measure"). They are charged at the same sites as before – voluntary_blocks when a thread blocks itself (cap_enter wait, park, endpoint scheduling-context donation) and preemptions when the timer requeues a still-runnable running thread – so the measure build’s counts are unchanged; only the cfg gate was removed. A low voluntary_blocks count distinguishes a CPU-saturating thread from an IPC/IO-bound one.
runnable_accumulated_ns is a new always-built cumulative counter of runnable-but-not-running time. It is charged at the scheduler-lock-held enqueue/select boundary: push_reserved_run_queue_locked stamps a monotonic runnable_since_ns when a thread is published to a per-CPU run queue without being selected (idempotent across re-publish, so the whole runnable span is counted once), and account_thread_scheduled accumulates the monotonic delta and clears the stamp when the thread is next selected. The stamp/accumulate pair nets to zero for a thread selected at the same monotonic instant it becomes runnable. The clock is monotonic_ns() only (no wall-clock, no rewind), matching charge_runtime’s discipline, and the stamp respects the runnable-ownership rules above (a thread holds a live stamp only between enqueue and selection).

migrations stays measure-gated; it is a placement diagnostic, not a saturation input. The surface exports raw cumulative counters only – windowing, smoothing, and the saturation decision are userspace policy-daemon choices, never kernel state (see docs/proposals/tickless-realtime-scheduling-proposal.md). Proof: make run-thread-fairness reads the extended snapshot on the weighted workers and asserts the CPU-bound hog reports high runtime_ns with voluntary_blocks at or near zero while at least one preempted lower-weight worker reports nonzero preemptions and runnable_accumulated_ns.

The daemon reads bounded per-target profiles from initConfig.init.autonohzPolicy, validates each declared pool-grant binding, and applies the configured observation window, smoothing count, saturation thresholds, lease bounds, and renewal budget. It drives multiple account/pool leases through issuance, renewal, stopped-renewal expiry, and explicit revocation while the kernel remains the admission, accounting, timeout, and periodic-fallback authority. make run-scheduler-autonohz-policy-service proves that lifecycle. Cross-process observation remains future because SchedulingPolicyCap.snapshot is caller-thread scoped.

The daemon admits finite lease lifetimes through a 119-second policy ceiling. After a stop-renewal decision, the managed worker remains command-idle under the bounded lease while the daemon polls authoritative kernel state during later target management and candidate selection. A terminal observation before the deadline triggers immediate worker, context, and candidate-ledger cleanup as an ended lease. At or after the deadline, userspace records only that the deadline was reached; the kernel release record supplies the exact termination cause. The final concurrent field is an observation: a valid policy whose targets are all denied still completes cleanly with concurrent=false; the focused QEMU harness, rather than the reusable runtime, requires the fixture’s two-live-lease window.

Weight-change-while-enqueued contract

SchedulingPolicyCap.setWeight writes the validated weight directly to Thread.weight through Process::set_thread_weight and does not clear Thread.virtual_finish_ns. A weight change observed while the thread is blocked, running, or already queued takes effect on the next dequeue and re-enqueue because every enqueue site refreshes virtual_finish_ns from current weight/latency_class/ virtual_runtime_ns. The kernel proves the contract two ways:

By construction. Process::refresh_thread_virtual_finish_ns reads each input field fresh on every call; there is no cached derivation between enqueues. The function bears a doc-comment asserting the contract.
By debug_assert!. Inside the same function, a debug assertion verifies that the recomputed virtual_finish_ns is at or beyond the current virtual_runtime_ns – a future deadline, never a past one. The assertion catches any future regression where the formula could underflow or where a stale cache could drift below the current vruntime.

The focused QEMU smoke that drives setWeight and verifies the post-block dispatch picks up the new weight landed under Phase D Task 5: make run-thread-fairness-weight-change (manifest system-thread-fairness-weight-change.cue, demo demos/thread-fairness/). Two competing child threads run a fixed wallclock window: a baseline worker stays at DEFAULT_WEIGHT, while a heavy worker self-calls SchedulingPolicyCap.setWeight(weight=128) and then blocks on Timer.sleep so it leaves the run queue before the contention window opens. Each worker snapshots its scheduler state at wake and at window end via SchedulingPolicyCap.snapshot, and the parent verifies three independent properties: (1) the heavy snapshot reads weight == 128 and the baseline snapshot reads weight == DEFAULT_WEIGHT; (2) the observed runtime_ns ratio matches the weight ratio inside a configured tolerance; (3) the heavy worker’s virtual_runtime_ns advances at roughly half the rate of its runtime_ns (vruntime/runtime ~= 0.5 for weight=128, ~= 1.0 for DEFAULT_WEIGHT). A scheduler that re-enqueued or dispatched the heavy worker using a stale virtual_finish_ns derived from DEFAULT_WEIGHT would not show the weight-proportional CPU share, and a scheduler that held a stale weight inside charge_runtime would yield heavy vruntime/runtime ~= 1.0 instead of ~= 0.5; the smoke trips on either regression. The capability is bound to CapCallContext::caller_thread (Phase D Task 2 decision), so same-thread self-mutation is the only authorized shape for this proof; cross-thread weight authority remains a Phase H privileged scheduler-policy service concern.

The thread-scale benchmark was repaired before accepting the milestone. The old 1 MiB/spinning-parent shape was not a valid four-core reference because the matching Linux pthread baseline also failed at four workers. The accepted benchmark shape uses a blocking parent join, 262,144 blocks (16 MiB), and work_rounds=64. The formal accepted-evidence pair is the capos-bench 2026-05-02 21:38 UTC 5-run pair pinned to physical-core logical CPUs 0,1,2,3 against main commit 374f8556: capOS work 1.883x and total 1.787x clear the configured 1.6x gates, while the matching Linux pthread baseline records 1.988x/1.987x. Its 1-to-4 row became the diagnostic that justified Phase D’s fair-share enqueue policy: capOS 1.566x/1.538x versus Linux 3.963x/3.858x, a clear bottleneck in the then-current single-global-queue scheduler. Phase D’s WFQ evidence on 2026-05-10 manually accepted the recorded 1-to-4 diagnostic with capOS 3.088x/2.700x and matching Linux 3.974x/3.850x on the same host/CPU pin set. The harness still enforced only the configured 1-to-2 work/total speedup gates. Historical pre-collapse 1-to-2 (1.828x/1.687x) and the post-collapse 3-run diagnostic on capos-bench 2026-05-02 10:42 UTC (1.890x/1.792x, 1.504x/1.436x) remain in docs/benchmarks.md for reference. Four-worker capOS scaling was a follow-up rather than a completed claim under the pre-collapse model: the unsuppressed diagnostic recorded 1-to-4 work/total speedups 3.029x/2.386x, while suppressing scheduler switch logs recorded 3.272x/2.303x; remaining guest-measure evidence pointed at global Scheduler lock contention plus exit/join/block/schedule overhead, and normal scheduler-owned execution is still capped at temporary CPU slots 0-3. Each process currently owns one or more Thread records; each thread owns its saved CPU context, kernel stack, FS base, block state, and – since Phase D Task 2 – the WFQ ordering inputs weight: u16, latency_class: LatencyClass, and virtual_finish_ns: u64. The Phase D constants in capos-abi/src/scheduler.rs set the defaults weight = DEFAULT_WEIGHT and latency_class = LatencyClass::Normal, so unmodified workloads observe no behavior change versus the pre-Phase-D scheduler. virtual_finish_ns is recomputed on every enqueue (Task 2 ships the derivation; Task 3 will consume it for ordered insertion) and is not meaningful while the thread is blocked.

Phase D Task 2 split the per-thread CPU accounting record so the WFQ-load- bearing fields are available in the normal qemu build: runtime_ns, virtual_runtime_ns, and last_started_ns are unconditional; context_switches, preemptions, voluntary_blocks, migrations, last_cpu, and the *_runtime_stable_observed and blocked/exited bookkeeping stay behind the measure feature because they are pure operator-observability counters that do not participate in dispatch ordering and need a separate operator snapshot path. runtime_ns advances 1:1 with elapsed CPU time, while virtual_runtime_ns advances by elapsed_ns * REFERENCE_WEIGHT / weight so per-thread weight changes the cumulative WFQ share rather than only the enqueue tag. The runtime-charge path is invoked when a current thread stops running through timer preemption, blocking cap_enter or park, thread/process exit, or direct switch/handoff paths that select another current thread; the wrapping helpers in kernel/src/sched.rs route through Process::charge_thread_runtime / Process::account_thread_scheduled unconditionally now.

The SchedulingPolicyCap cap surface mutates these per-thread fields through the caller-thread fallback binding selected in Phase D Task 2: every method (setWeight, setLatencyClass, snapshot) routes to CapCallContext::caller_thread, so a holder can only mutate or observe its own running thread. Cross-thread or cross-process authority is reserved for the Phase H privileged scheduler policy service. The SchedulingPolicyCap.snapshot reply intentionally exposes only the four fields promoted out of the measure feature gate; context_switches/preemptions/voluntary_blocks/migrations are benchmark-only and a future operator-observability slice may add them through a separate cap. The BSP scheduler tick normally arrives through the local APIC timer on vector 48 with LAPIC EOI after calibrating the LAPIC initial count against PIT channel 2; if LAPIC setup or calibration is unavailable, the kernel falls back to the legacy PIT/PIC IRQ0 path on vector 32. On each user-mode timer tick (kernel-mode ticks bypass the scheduler entirely through kernel_timer_interrupt_handler, as described under Design), the kernel wakes timed-out or satisfied cap_enter and park waiters, processes the current thread’s ring endpoint in timer mode, saves the current thread context, picks the next ready thread from the single global run queue (the earlier per-CPU local-first / steal scan was retired with the queue collapse), switches CR3 when needed, updates the current CPU’s kernel-entry stack through the per-CPU hook, restores FS base, mirrors the next ThreadRef into the current PerCpu, and returns to the next user context.

When APs are online and their LAPIC timers start, scheduler CPU slots 0-3 can temporarily own scheduler/user execution. The earlier AP-owner proof kept the BSP in kernel idle; the current same-process scaling slice allows sibling threads with distinct ring endpoints to run on different scheduler CPUs while processes that hold broad launch/authority caps or live endpoint objects remain pinned to the legacy single-owner CPU. Additional APs beyond CPU 3 stay in kernel idle until a later scheduler-owner policy replaces the temporary CPU mask. The runnable queues are a per-CPU array of VecDeque<ThreadRef> shared by the scheduler-owned CPUs under the global scheduler lock and ordered ascending by virtual_finish_ns; process/thread metadata remains shared under that lock. A bounded steal path migrates the most overdue sibling candidate (each sibling queue’s first entry that the destination CPU considers Runnable) when a CPU’s local queue has no runnable entry.

Syscall entry initializes kernel GS with swapgs, saves the user RSP through the GS-relative PerCpu.user_rsp slot, and switches to the GS-relative PerCpu.kernel_rsp slot. Normal syscall returns swap back before sysretq. Blocking cap_enter, process exit, and ThreadControl.exitThread paths that leave through scheduler iretq restore use restore_context_after_syscall so GS ownership is returned to userspace before the next user context resumes.

Timer.sleep records a bounded scheduler waiter keyed by caller ThreadRef, user data, and an absolute monotonic deadline_ns. Due sleeps validate the thread generation, post an empty completion directly to the caller’s CQ, and then flow through the same blocked cap_enter wake scan as other completions. Each process has a separate sleep waiter quota, so one Timer holder cannot fill the global sleep queue by itself.

ThreadControl.setFsBase validates runtime-provided FS bases as user-canonical addresses, updates the caller thread’s saved FS base, and writes the CPU FS base immediately when the caller is the running thread. There is no process-global FS base; context switch treats FS base as per-thread state.

The initial thread still uses the compatibility ring at RING_VADDR, while each spawned child thread receives a kernel-chosen ring mapping in the process ring arena. Run queues, per-CPU current, direct IPC handoff, Timer sleep waiters, process/terminal waiters, endpoint caller/receiver records, and deferred cancellation CQEs store generation-checked ThreadRef values and route completions to the target thread’s ring endpoint. Process-owned thread and kernel-stack ledger limits are enforced by ThreadSpawner.create before additional thread records become runnable. The frozen contract is in In-Process Threading. Park wait uses a separate Blocked(Park { ... }) reason and park timeout/wake completions use reserved CQE credits before marking generation-checked waiter threads runnable. The authority and ABI contract is in Park Authority.

cap_enter(min_complete, timeout_ns) processes pending SQEs immediately. If the requested completion count is not available and the timeout permits blocking, the current thread enters Blocked(CapEnter { ... }) and the syscall entry path switches to another runnable thread.

The LAPIC user-timer path enters sched::schedule() unconditionally on every tick. An earlier slice carried a bounded user-mode continuation fast path with a per-CPU one-skip budget and a release/acquire slow-path-required summary; that path has been retired (see docs/backlog/scheduler-evolution.md “Cleanup: Retire Benchmark-Driven Scaffolding Before Phase D”). The fast path saved at most one scheduler entry every other tick on an uncontended single-CPU-effective scheduler while paying for shadow-state publication on every slow-path exit, so the simpler always-schedule shape is preferred until a future Phase D or Phase F slice ships an evidence pair where the fast path measurably reduces scheduler-lock hold time on a contended SMP run.

When endpoint delivery satisfies a blocked server RECV, the scheduler can set a direct IPC target. The next scheduling decision runs that server before ordinary round-robin work when it is ready and its ThreadRef generation still matches the captured direct target. When the direct slot is unavailable, endpoint completions fall back to the queued path with WakePolicy::QueueCpu(slot) targeting the resolved eligible per-CPU queue, so the wake scan probes the placed CPU first. If an ineligible CPU consumes the global direct preference, its RetryLater fallback uses the same placement resolver and wakes the resolved owner CPU.

Design

The implementation keeps ring dispatch outside the global scheduler lock. Timer dispatch extracts ring/cap/scratch handles, releases the scheduler lock, processes bounded SQEs, then reacquires the scheduler lock to choose the next thread. This prevents Cap’n Proto decode, serial output, and capability method bodies from running under the global scheduler lock.

There is no longer a slow-path-required summary or a per-CPU skip budget for the user-mode timer path. Every user-mode LAPIC timer tick enters sched::schedule(), which services run-queue entries, direct IPC targets, deferred process termination/drop and thread-stack cleanup, Timer sleep waiters, and blocked threads with timer-backed cap_enter or Park timeouts under the scheduler lock. Those timeout paths compare absolute monotonic deadlines, but periodic ticks still decide when the checks run. Ring SQEs and ordinary cap waiters run on the same per-tick cadence. Kernel-mode timer ticks (e.g., on AP cores parked in the kernel idle loop) still go through kernel_timer_interrupt_handler, which sends EOI without entering the scheduler. The shared advance_bsp_tick helper still increments the compatibility TICK_COUNT only on CPU 0; normal runtime accounting and timeout comparisons use monotonic_ns() instead. Future per-CPU fair-share slices may reintroduce a continuation path under explicit Phase D or Phase F authority; until then the always-schedule shape keeps the scheduler’s authority over thread metadata and runnable ownership single-source.

The runnable queues keep a single-owner contract behind the global scheduler lock. A live generation-checked ThreadRef may have at most one runnable dispatch owner across per-CPU current/handoff_current slots, the per-CPU run queues, and the single direct_ipc_target preference slot. Blocked waiters, sleep waiters, park waiters, endpoint state, process waiters, and join waiters are not runnable owners; they may make a thread ready only after liveness and generation checks succeed.

Migration between per-CPU queues is represented as a scheduler-lock- contained transfer, not as a second published owner. The source owner is removed or popped first and the ThreadRef is then inserted in the destination queue at the position determined by a freshly recomputed virtual_finish_ns, or selected as the next running thread. virtual_runtime_ns travels with the thread; virtual_finish_ns is recomputed at every enqueue and never carried as committed state, so weight or class mutations applied while the thread was blocked take effect on the next dequeue and re-enqueue. Retry paths requeue the candidate after dropping duplicate queued copies and resolving the target queue against static CPU eligibility. Direct IPC keeps its preference slot only while the target remains live and runnable; if the direct target cannot run immediately, it falls back through the normal queued-owner path on an eligible per-CPU queue.

Idle-to-runnable wake targeting reuses the same ownership boundary. A thread that becomes ready through endpoint completion, timer sleep, park wake, process wait, or thread join is pushed to the placement target’s per-CPU run queue, and wake_idle_scheduler_cpus_locked first probes the placement target when the policy is QueueCpu, then walks eligible idle scheduler CPUs to wake the first that accepts a fresh reschedule IPI; CPUs that already have a pending IPI (or that fail LAPIC delivery) are skipped without breaking the scan, so a burst of ready work cross-wakes more than one neighbor instead of stranding the rest behind one already-targeted CPU. Direct IPC uses the same path. Measurement builds expose aggregate and per-phase counters for wake scans, eligible idle CPUs, targeted CPUs, IPIs sent, already-pending IPI skips, not-ready target skips, missing LAPIC targets, and send failures.

Each per-CPU run queue is reserved up to the live runnable-capable thread count before publication; the shared live reservation count is released on process/thread exit or pre-publication rollback. Reserving each queue to the full live-thread count is required because the bounded steal path may migrate every live thread into a single sibling queue between two scheduler passes. Timer preemption, unblock, direct- IPC fallback, requeue, and steal-requeue paths therefore must not allocate while the thread is already live.

Process and thread exit cleanup proves the removal side of that ownership contract at the cleanup site. After removing queued owners and clearing a matching direct IPC target, the scheduler lock remains held while the kernel scans every per-CPU runnable queue and the direct target slot; any stale exiting process or thread reference is a kernel assertion failure. The focused spawn smoke asserts the corresponding serial proof markers on exercised process and thread exit paths.

The Phase C migration order is constrained by hardware state, not only by scheduler data structures. The first gate moved syscall entry/exit off BSP-symbol-relative PerCpu fields and onto KernelGsBase/swapgs on user syscall paths, including blocking cap_enter, exit, and ThreadControl.exitThread paths that leave through iretq rather than the normal sysretq epilogue. The second gate added xAPIC initialization, a PIT-calibrated BSP LAPIC timer tick, LAPIC EOI routing, AP LAPIC initialization, a LAPIC spurious-vector handler, and an IPI vector plus bounded vector-49-only fixed IPI send primitive. The third gate added address-space resident CPU masks, per-CPU pending full-TLB flush generations, completion waits, and a vector-49 TLB shootdown handler for user page-table map, unmap, and protect. The fourth gate split current-thread tracking into per-CPU slots, registers AP PerCpu records for current-thread and syscall stack mirrors, updates AP TSS.RSP0 on context switches, and hands the single scheduler-owner role to AP cpu=1 when it is online with a programmed LAPIC timer.

The LAPIC slice replaces the BSP-oriented PIT/PIC scheduler tick on supported QEMU and hardware paths. kernel/src/arch/x86_64/idt.rs keeps vector 32 for the PIT/PIC fallback, reserves vector 48 for LAPIC timer delivery plus vector 49 for cross-CPU requests, and installs vector 255 for LAPIC spurious interrupts. pic.rs can remap and mask all legacy IRQs once LAPIC ticks are active, and context.rs sends LAPIC EOI or PIC EOI according to the active timer source. The IPI vector now handles TLB shootdown requests and bounded reschedule requests for AP idle-to-runnable handoff.

The TLB slice wraps user page-table mutations that can affect an address space resident on another CPU. AddressSpace::map, AddressSpace::unmap, and AddressSpace::protect still perform the local x86_64 mapper flush, then call the architecture shootdown helper with the address space’s resident CPU mask. The helper records pending full-TLB flush generations for online resident CPUs other than the caller, sends vector-49 IPIs, and returns a completion token. Capability handlers drop the address-space guard and enqueue completion work; cap_enter and timer polling drain that queue after ring dispatch releases the cap-table and scratch locks. This keeps a remote syscall that is contending on the same process locks from blocking maskable IPI delivery forever. Capability handlers reserve fixed-size deferred queue slots before page-table mutation, so full queues fail closed as capability overload errors instead of surfacing after rollback, unmap, or protect has already changed state. Drains flush the current CPU before waiting so a CPU that is itself in the target mask cannot wait on its own pending generation. Target CPUs drain the generation in the IPI handler, at syscall entry, or before returning to userspace from syscall, timer, and scheduler restore paths. Generation counters avoid losing overlapping shootdowns while a target CPU is already draining a prior request. This relies on kernel user-buffer access continuing through address-space-locked HHDM copy/read helpers rather than raw user virtual addresses while a delayed flush generation exists. Callers include VirtualMemoryCap dispatch through parse_map, parse_unmap, and parse_protect, plus MemoryObjectCap::{map,unmap,protect} in kernel/src/cap/frame_alloc.rs. Scheduler CR3 handoff now marks the selected address space resident on the current CPU, including AP cpu=1 during the AP scheduler-owner proof.

Idle paths

There are two distinct idle paths, and both run genuine CPL0 (kernel-mode) idle. There is no user-mode idle process: when no real work is runnable a CPU runs the kernel idle code at CPL0 on the kernel PML4. The two paths differ only in how the CPU got there.

The cooperative CPL0 kernel-mode idle path is the boot/AP path. start (BSP), start_ap (APs), and the start_current_cpu loop call next_start_context; when that returns no real runnable work they fall into idle_current_cpu_once, which hlts at CPL0 on the per-CPU kernel stack with interrupts enabled (no CpuContext, no restore_context — the same way start_current_cpu itself runs). A kernel-CPL timer tick or reschedule IPI taken during that hlt runs the kernel-mode handler (kernel_timer_interrupt_handler / handle_reschedule_ipi, both of which call nohz_recheck), so the nohz one-shot deadline is preserved and re-armed across the hlt; control then returns to the loop, which re-checks for work. idle_current_cpu_once increments the KERNEL_IDLE_HLT_ENTRIES counter and emits a bounded cpu-isolation: kernel-idle hlt cpu=… idle_path=cooperative-cpl0 … nohz_active=… timer_source=… log line so this path is observable from the kernel log; the run-scheduler-cpu-isolation-lease smoke asserts it is reached. Once any dispatch path restore_contexts into a real thread, the start_current_cpu frame is abandoned.

The steady-state CPL0 idle-thread path is reached from the four interrupt/syscall-return dispatch call sites — schedule() (timer), capos_block_current_syscall, exit_current, and exit_current_thread. When choose_next_locked falls through to this CPU’s idle thread, each site builds the dispatch tuple from the per-CPU CPL0 idle-thread context. The dispatch call sites hand a CpuContext to assembly that restore_contexts (or, for the timer path, return a context pointer plus a CR3 the timer handler loads), so they need a schedulable context when no real work is runnable; the CPL0 idle context is that context.

CPL0 idle-thread context infrastructure. arch::smp::init_idle_kernel_stacks allocates one dedicated CPL0 idle kernel stack per scheduler CPU slot from fresh contiguous frame ranges, so they do not overlap the boot kernel stacks, the per-thread kernel stacks, or the IST slots. CpuContext::new_cpl0_idle builds a kernel-shaped context (kernel-code/kernel-data selectors, rip = kernel_idle_entry, rsp into the idle kernel stack). sched::sched_init, called from kmain, constructs and stores one CpuContext per CPU slot in CPL0_IDLE_CONTEXTS and then calls register_idle_process_locked to seed the slot-0 synthetic idle Process record before the scheduler runs (this keeps the BSP idle process’s low PID and the init-process PID ordering stable); the remaining per-CPU slots are registered lazily by current_cpu_idle_thread_locked the first time their CPU reaches idle. sched_init panics on OOM, as does the lazy path: the CPL0 idle contexts and the synthetic idle records are scheduler idle infrastructure and there is no fallback idle path, so a failure to build them is unrecoverable. The idle kernel stack is sized as a full per-thread kernel stack (PROCESS_THREAD_KERNEL_STACK_PAGES), not an IST slot, because kernel_idle_entry runs the deep service_periodic_work() call chain on it (see periodic-service parity below).

Synthetic idle process records. The idle thread is never a runnable user-mode process. The synthetic idle Process (Process::new_idle) maps no user code, no user stack, and no cap ring, and carries an empty cap table. It exists only so the idle ThreadRef resolves through sched.processes and the scheduler’s ThreadRef-centric bookkeeping — set_thread_state, account_thread_selected_locked, current-thread tracking, and the is_idle_thread guard predicate used pervasively across the scheduler — keeps working unchanged. Its address_space is a bare page-table root with nothing user-mapped; it is required by the Process struct but is never loaded as CR3. Every idle dispatch site routes the CPU onto the kernel PML4 via the CPL0 idle context, so the synthetic idle AddressSpace is never made resident and never participates in resident_cpu_mask or TLB-shootdown idle-residency handling.

Dispatch-tuple rewire. After choose_next_locked returns, when the chosen thread is idle_threads[current_cpu_slot()], each dispatch site builds the dispatch tuple from the CPL0 context pointer, the dedicated idle kernel stack top, the kernel PML4 CR3, and the current FS base (no FS-base change). sched_init builds one CPL0 idle context per scheduler CPU slot or panics, so cpl0_idle_context(slot) is infallible at every dispatch site. The schedule() timer path does not route through a dedicated CR3-loading restore helper: the existing timer_interrupt_handler already loads the tuple’s CR3 with write_cr3 before the privilege-agnostic five-element iretq. The three syscall-path sites (capos_block_current_syscall, exit_current, exit_current_thread) keep their restore_context_after_syscall restore tail: they are entered via syscall_entry (which already executed swapgs), so the exit swapgs is required to leave the CPL0 idle thread running with the user GS base — the same GS-base state the timer path’s CPL0 idle thread runs with. Each site emits a distinct marker: sched: dispatch idle cpu=N idle_path=cpl0-dispatch-timer (timer), …cpl0-dispatch-block (blocking syscall), and …cpl0-dispatch-exit (both exit_current and exit_current_thread). debug_assert!s guard the CPL0 dispatch tuple: context cs/ss are the kernel selectors and their RPL bits are 0.

CPL0 idle periodic-service parity. schedule()’s timer Phase 2 runs periodic service work on every tick — deferred process drops, pending terminations, wake_cap_waiters, service_sqpoll_workers(), drain_pending_endpoint_cancellations(), terminal_session::poll_input(), virtio::poll_scheduler(), and the network / pipe / interrupt poll_waiters() calls. A CPL0 idle thread’s timer ticks are kernel-mode and go through kernel_timer_interrupt_handler, which never enters schedule() — so without explicit parity handling that servicing would be stranded whenever a CPU is parked on the CPL0 idle thread. That work is factored into a single service_periodic_work() function with one lock discipline: the scheduler lock is taken only for the bounded deferred-drop / thread-stack-release / wake_cap_waiters / pending-termination extraction, then dropped before drop_pending_process / finish_terminated_process and the lock-free poll block. schedule() calls it after ring dispatch; kernel_idle_entry is its own cooperative loop that, each iteration, runs service_periodic_work(), then next_start_context(false) to re-dispatch a real runnable thread the moment one appears (allow_idle = false so it never re-selects the idle thread), then idle_current_cpu_once() to hlt. The re-dispatch is required: without it a kernel-mode timer tick taken during the idle hlt returns through kernel_timer_interrupt_handler, which does not re-enter schedule(), so the CPU would be stranded. service_periodic_work() and next_start_context() run with interrupts disabled in that loop — the CPL0 idle context is built IF=1 so the periodic tick can preempt the hlt, so the loop must cli before the deep service call; otherwise a CPL0 timer tick taken during service_periodic_work() nests a kernel_timer_interrupt_handler frame onto the idle kernel stack (same-privilege interrupts do not switch stacks). idle_current_cpu_once re-enables interrupts only across its enable_and_hlt and disables them again before returning. There is no double-service: a CPU running a real thread gets the service block via schedule(), a CPU on the CPL0 idle thread gets it via the kernel_idle_entry loop, and a given tick on a given CPU is CPL3 (schedule()) xor CPL0-idle (the loop). nohz cadence stays honest because the loop iterates at the timer/IPI cadence — when the periodic tick is suppressed the re-armed one-shot still wakes the hlt, so service_periodic_work() still runs.

`iretq` CPL0 restoration invariant and CPL0 idle-thread prerequisites

This subsection records the load-bearing x86-64 architectural invariant that any future CPL0 idle-thread context migration must satisfy, along with the prerequisites the implementation will need to meet.

Authoritative reference: Intel 64 and IA-32 Architectures Software Developer’s Manual (SDM), Volume 2A, IRET/IRETQ instruction reference, “Operation” pseudocode (the IF OperandSize = 64 / 64-bit-mode path), and Volume 3A, Section 6.14.3 “Returning from an Exception or Interrupt Procedure.” The description below applies to IRETQ in 64-bit long mode; the legacy 32-bit IRET paths behave differently and are called out explicitly where it matters.

iretq frame layout and the 64-bit unconditional five-element pop. iretq in 64-bit long mode unconditionally pops five 64-bit (8-byte) values from the top of the current kernel stack, in order: RIP, CS, RFLAGS, RSP, SS. This is true regardless of whether the privilege level changes — both a CPL0→CPL3 return and a CPL0→CPL0 return consume the same five-element frame and load RSP:SS from it. AMD deliberately removed the legacy conditional stack switch for long mode: the “skip SS:ESP on a same-privilege return” behavior exists only in the legacy 32-bit IRET operand-size paths, never in IRETQ.

CPL0 → CPL3 (privilege change, ring exit): The target CS has RPL=3, which differs from the current CPL=0. The CPU installs RIP, CS, and RFLAGS from the frame, then loads RSP and SS from the same frame and transfers to the user-space instruction at RIP on the user stack.
CPL0 → CPL0 (same-privilege, no ring change): The target CS has RPL=0, matching the current CPL=0. iretq still pops all five elements: it installs RIP, CS, and RFLAGS, and also loads RSP and SS from the frame, exactly as in the CPL3 case. There is no same-privilege short-circuit in 64-bit mode. The practical consequence for a CPL0 restore is the opposite of the legacy intuition: the frame’s rsp and ss fields are load-bearing and must carry a valid kernel stack pointer and a valid RPL=0 stack selector, because the CPU will load them.

Current code. restore_context (kernel/src/arch/x86_64/context.rs lines 311–328) sets RSP to the supplied CpuContext pointer, pops all fifteen caller-saved and callee-saved GPRs (lines 315–327), and executes iretq (line 328). The CpuContext struct (context.rs lines 133–155) places rip, cs, rflags, rsp, and ss at the high end of the struct (lines 150–154), matching the hardware interrupt-frame layout that the CPU pushes when it enters the timer interrupt handler. The comment at line 149 (“Pushed by CPU on interrupt from Ring 3”) reflects how every CpuContext is populated today, but the five-element iretq frame itself is not CPL3-specific — iretq consumes the same five elements for any target CPL.

User-thread contexts. Every user-thread CpuContext is built by Thread::new_user (kernel/src/process.rs), which sets cs = sel.user_code.0 as u64 (RPL=3, value 0x23) and ss = sel.user_data.0 as u64 (RPL=3, value 0x1B). Every iretq issued by restore_context or restore_context_after_syscall into a user thread is therefore a CPL0→CPL3 privilege change into a fully user-shaped context.

CPL0 idle contexts coexist with user contexts. The blocker for a CPL0 target is not iretq frame arithmetic: iretq pops the same five elements for a CPL0 target as for a CPL3 target, so a frame carrying kernel selectors and a valid kernel rsp iretqs correctly. The real requirements are in the surrounding dispatch plumbing, all of which the CPL0 idle path satisfies:

CR3. The dispatch call sites set CR3 to the kernel PML4 for the CPL0 idle path, not to any user AddressSpace page table. The synthetic idle Process’s AddressSpace is never loaded as CR3.
swapgs / GS-base. A CPL0 idle context was never entered through the syscall path. The schedule() timer path reaches it through the timer handler’s own CR3 load and the privilege-agnostic iretq tail (no swapgs in that path at all). The three syscall-path sites (capos_block_current_syscall, exit_current, exit_current_thread) keep their restore_context_after_syscall tail: those sites were entered via syscall_entry (which already swapgsed), so the exit swapgs is required to undo it — leaving the CPL0 idle thread running with the user GS base, the same state the timer path produces.
Kernel-code and kernel-data selectors. A CPL0 CpuContext uses cs = sel.kernel_code.0 as u64 (RPL=0, value 0x08) and ss = sel.kernel_data.0 as u64 (RPL=0, value 0x10). Because iretq loads ss unconditionally in 64-bit mode, ss must be a valid RPL=0 stack selector; the GDT data-selector privilege checks require an RPL=0 ss to be paired with an RPL=0 cs, so the whole context (cs, ss, rsp, CR3, GS base) is kernel-shaped together.
Idle kernel stack. Each CPL0 idle thread has its own dedicated kernel stack (arch::smp::init_idle_kernel_stacks) that does not overlap any IST slot, any per-thread kernel stack, or the BSP/AP boot stacks. Because iretq loads rsp from the frame, the context’s rsp points into this dedicated stack. It is sized as a full per-thread kernel stack because kernel_idle_entry runs the deep service_periodic_work() call chain on it.
No user AddressSpace residency. The synthetic idle Process’s AddressSpace is never made resident and never participates in resident_cpu_mask, so TLB shootdown never stalls waiting for an idle CPU.
No blocking, no exit. The idle thread never calls cap_enter, parks, blocks on any waiter, or exits. The Invariants section entry “The idle thread must never block in cap_enter or exit” carries forward unchanged.

CpuContext::new_cpl0_idle builds the kernel-shaped context, sched::kernel_idle_entry is the entry point, and sched::sched_init wires the per-CPU CPL0 idle contexts and seeds the slot-0 synthetic idle process record (the remaining slots’ records are registered lazily by current_cpu_idle_thread_locked). All four dispatch call sites — schedule(), capos_block_current_syscall, exit_current, exit_current_thread — route idle dispatch onto the CPL0 idle context: the timer path returns the CPL0 context pointer plus the kernel PML4 CR3 in its dispatch tuple and relies on the existing timer_interrupt_handler CR3-load; the three syscall-path sites keep their restore_context_after_syscall tail so the syscall-entry swapgs is undone. The CPL0 contexts are kernel-shaped across cs, ss, rsp, and CR3 together.

Measurement Policy

Design grounding for this policy: this document’s scheduler invariants, docs/backlog/scheduler-evolution.md, docs/proposals/scheduler-evolution-proposal.md, docs/research/future-scheduler-architecture.md, docs/research/out-of-kernel-scheduling.md, docs/research/nohz-sqpoll-realtime.md, and docs/research/completion-ring-threading.md. In particular, docs/research/future-scheduler-architecture.md keeps the always-on versus benchmark-only scheduler telemetry split as an open scheduler question, and the current answer is intentionally conservative.

The current kernel/src/measure.rs counters are benchmark instrumentation, not normal operator observability. They stay behind the measure feature and CAPOS_THREAD_SCALE_GUEST_MEASURE=1 because they add atomics, cycle-counter reads, phase bookkeeping, and in some cases sampled user RIP values to hot scheduler, timer, TLB, ring, and serial paths. Normal QEMU and dispatch builds must not depend on those counters being present.

The per-thread runtime-accounting ledger is split. The WFQ load-bearing core fields, runtime_ns, virtual_runtime_ns, and last_started_ns, are unconditional normal-build state on ThreadCpuAccounting: WFQ ordering, SchedulingPolicyCap.snapshot, and SchedulingContext budget charging depend on them outside cfg(feature = "measure"). The diagnostic fields (context_switches, preemptions, voluntary_blocks, migrations, last_cpu, blocked/exited stability probes, placement buckets, and per-phase attribution counters) stay behind the measure feature. Permanent operator observability is still separate work: it should expose low-rate, non-symbolic snapshots derived from the unconditional ledger plus event counters such as runnable queue depths or high-water marks, reschedule IPI sent/failed/pending counts, TLB shootdown request/failure counts, and scheduler policy admission or denial counts. Those counters must not allocate, log, read raw user PCs, or perform cycle-timing in timer, unblock, direct-IPC fallback, requeue, or steal-requeue paths.

Benchmark-only attribution stays in measure: per-phase thread-scale checkpoints, guest cycle timings for ring/capnp/method/scheduler segments, scheduler-lock wait and hold cycles, scheduler-lock site attribution, serial byte attribution, timer-mode breakdown, CR3/TLB event totals, thread-placement selection/migration buckets, raw user-PC samples, logging-suppression A/B evidence, and workload/cacheline diagnostics. The publish-placement publish/caller-aware buckets were retired with the per-CPU run-queue collapse. Phase D shipped the fair-share enqueue policy but did not reintroduce those placement counters. A future branch may promote a specific event count only by adding the normal-build storage/API and proving the same emergency-path constraints; it should not simply remove the current cfg(feature = "measure") boundaries from the benchmark module.

The publish-placement publish/caller-aware buckets are still retired; Phase D Task 3 brought back per-CPU placement semantics but does not re-emit the publish counters. Re-instate them through a separate operator-observability slice that proves the same emergency-path constraints, not by removing the existing cfg(feature = "measure") boundary on the historical buckets.

Tickless idle is enabled only for true idle. A scheduler-owned CPU may mask the periodic LAPIC tick when it is running the CPL0 idle context, has no runnable non-idle work, has no active CpuIsolationLease nohz record, has no local deferred cleanup, has no cap-enter polling dependency, and the one-shot clockevent plus non-tick-derived monotonic clocksource are available. The replacement one-shot is bounded by the nearest Timer/ParkSpace deadline or a 100 ms idle housekeeping floor, and the scheduler restores periodic mode before non-idle dispatch, reschedule-IPI wake, or rollback. Cap-enter polling waiters, including the current terminal shell path, and ready threads paused in a SchedulingContext retry window keep the periodic tick until those dependencies move behind explicit deadlines or housekeeping placement.

Generic full-nohz for ordinary budgeted compute threads carries the clockevent/deadline substrate into the CPU-isolation state machine and suppresses ticks only after network polling, IRQ affinity, accounting, deadline, lifetime, and rollback obligations pass. SQPOLL nohz applies the same substrate to explicitly leased caller-thread rings once the SQPOLL worker is live and the single-consumer, owner-lease, wake, and rollback gates pass. Automatic policy issuance and broader SQPOLL userspace-poller/device-queue admission remain separate later CPU-isolation features; see Tickless and Realtime Scheduling and NO_HZ, SQPOLL, and Realtime Scheduling.

Exit switches to the kernel PML4 before tearing down the exiting address space, releases capability authority, completes process waiters, defers final process teardown until the scheduler is running on another kernel stack, and then releases remaining thread kernel stacks through the scheduler-owned OffStackToken path before the Process value is dropped.

Invariants

The idle thread must never block in cap_enter or exit.
Ring dispatch must not hold the scheduler lock.
Timer dispatch copies current-process user buffers through that process’s locked AddressSpace; it must not rely on a raw current-CR3 validate/use window.
Blocked cap_enter waiters wake when enough CQEs are available or their finite timeout expires.
Timer sleep waiters must be bounded per process, tied to the caller ThreadRef generation, and removed when the caller process exits.
Runtime-controlled FS bases must stay in user canonical space.
Direct IPC handoff is a scheduling preference, not a bypass of process liveness, generation, or state checks.
The scheduler must update TSS.RSP0 and the per-CPU syscall kernel RSP through percpu::set_kernel_entry_stack on each switch.
Each PerCpu.current_thread mirrors that CPU’s scheduler current slot; the scheduler lock remains the authority for current-thread and queue ownership even though dispatch/runnable state is now separate from shared process and thread metadata.
Each live ThreadRef may appear in the per-CPU runnable queues at most once across all queues, and every per-CPU queue’s capacity must be reserved up to the live runnable-capable thread count before a new process or thread becomes runnable.
Every queued ThreadRef must be statically eligible to run on its owning CPU. The common publication path resolves single-owner processes to CPU0 before applying the destination sleeper-credit floor and WFQ tag, and the ordered-insert boundary rejects an ineligible slot. Wake targeting must use the resolved queue slot, not an earlier placement preference.
A live generation-checked ThreadRef must have at most one runnable dispatch owner across per-CPU current/handoff_current slots, the per-CPU runnable queues, and the direct IPC target.
Queue migration (including the bounded steal path) must be a scheduler-lock-contained remove-before-publish transfer; no path may publish the same ThreadRef twice into any queue or leave a stale direct target after exit. Migration must recompute virtual_finish_ns at the destination and never carry the source’s WFQ tag as committed state.
Each per-CPU run queue must remain ordered ascending by virtual_finish_ns after every enqueue, requeue, or steal-requeue. Local selection scans the queue by index for the first destination-Runnable entry; RetryLater entries are left in place for the next scheduler pass. The bounded steal path scans each sibling queue’s indices ascending for that queue’s first Runnable-for- destination entry — because each queue is ordered ascending, the first Runnable hit per queue is the lowest virtual_finish_ns candidate the destination can accept on that source — then picks the source queue whose first-Runnable candidate has the lowest virtual_finish_ns globally, with ties broken by lower CPU id. The chosen entry is removed from its actual position on the source queue (not necessarily the head).
Process and thread exit cleanup must assert, before releasing the scheduler lock, that the exiting process or thread has no remaining entry in any per-CPU runnable queue and no remaining direct IPC target slot.
Timer, unblock, direct-IPC fallback, requeue, and steal-requeue paths must use reserved run-queue capacity and avoid allocation.
Runtime accounting must use the normal monotonic clocksource, not benchmark-only cycle counters, and must charge only running intervals.
FS base is saved and restored across context switches for TLS.
Thread records remain generation-checked ThreadRef identities; exited records are retained only while a live handle, pending join, or unjoined status can still observe them.
The final teardown of an exiting process must not release thread kernel stacks until another kernel stack is active, and the implicit Thread::Drop path must not free kernel-stack frames.
A scheduler CPU must never run the same generation-checked ThreadRef twice at once; same-process siblings may run on different scheduler CPUs only when their completions route through distinct per-thread ring endpoints.
Park waiters must be keyed by generation-checked ThreadRef values, reserve one waiter CQE credit, and must not allocate in wait, wake, timeout, or process-exit cleanup paths.

Code Map

kernel/src/sched.rs - shared process table plus SchedulerDispatch ownership of the per-CPU runnable queues (ordered ascending by virtual_finish_ns), per-CPU current/handoff slots, idle-thread slots, direct IPC target, run-queue reservation accounting, pending drops, and pending stack releases; also blocking, wakeups, Timer sleep waiters, the bounded steal path, and exit.
kernel/src/arch/x86_64/context.rs - CPU context layout, timer entry/restore, tick counter.
kernel/src/arch/x86_64/idt.rs - timer and IPI interrupt handler wiring.
kernel/src/arch/x86_64/lapic.rs - xAPIC MMIO setup, PIT-calibrated LAPIC timer, LAPIC EOI, spurious-vector handling, and fixed-IPI send primitive.
kernel/src/arch/x86_64/tlb.rs - serialized vector-49 TLB shootdown request, pending flush generations, completion token, and interrupt/user-return drain path.
kernel/src/arch/x86_64/pic.rs and kernel/src/arch/x86_64/pit.rs - legacy PIC remap and PIT fallback setup.
kernel/src/arch/x86_64/gdt.rs - BSP/AP TSS and kernel stack storage.
kernel/src/arch/x86_64/syscall.rs - blocking syscall transition for cap_enter.
kernel/src/arch/x86_64/percpu.rs - per-CPU syscall stack registry, TSS.RSP0 update hook, and current thread storage.
kernel/src/arch/x86_64/tls.rs - FS base save/restore.
kernel/src/process.rs - process state, kernel stacks, the synthetic idle process record, and per-thread CPU accounting storage/accessors.

Validation

make run-smoke validates timer preemption, ring fairness, direct IPC handoff, blocked cap_enter wakeups, process exit, and clean halt.
make run-spawn validates process wait blocking and child exit completion through ProcessHandle.wait, Timer monotonic now/sleep completion through timer-smoke, per-process sleep quota isolation through timer-flood, and thread/park lifecycle behavior through thread-lifecycle.
make run-measure validates the post-thread park blocked/resume timing path and process exit while a park waiter is parked.
make run-cloud-gce-legacy-virtio-webui-serving boots the GCE-shaped legacy WebUI topology with two CPUs, asserts that CPU1 is online with its scheduler timer, ages the CPU0-only WebUI through repeated finite five-second cap_enter waits, and requires a proof-feature-scoped scheduler marker that identifies the WebUI process and records a CPU1-preferred wake resolving to CPU0. Three delayed external health requests must then progress before the harness byte-verifies the fixed bundle over the legacy datapath.
cargo build --features qemu verifies QEMU-only scheduler and halt paths.
QEMU smoke output for IPC includes direct handoff diagnostics when the server is woken from a blocked RECV.

Open Work

Prove SQPOLL/poller progress that does not depend on periodic scheduler ticks before automatic nohz activation. Then implement tickless idle only for no-runnable-work CPU idle. Keep runnable contention on periodic preemption until the activation proof closes the remaining network polling, IRQ affinity, and housekeeping dependencies.
Keep SMP behind per-CPU scheduler state and review of any path that needs page pinning beyond the AddressSpace-locked copy/read contract.
Implement the remaining SMP Phase C slices: split shared scheduler metadata, replace the temporary scheduler-owner mask, and collect accepted benchmark evidence.
Add priority or policy scheduling only after the current authority and IPC semantics remain stable.
Add service restart policy outside the static boot graph.

capOS Documentation