# Scheduling

Scheduling decides which thread runs, preserves CPU state across preemption and
blocking, and integrates capability-ring progress with process-owned execution
resources.


## Current Behavior

The scheduler stores shared process/thread metadata in
`Scheduler::processes: BTreeMap<Pid, Process>`. Dispatch-owned runnable state
lives in `SchedulerDispatch`: a per-CPU `run_queues: [VecDeque<ThreadRef>;
SCHEDULER_CPUS]` array ordered ascending by `Thread.virtual_finish_ns`,
per-CPU `current` and `handoff_current` slots, idle-thread slots, the
direct-IPC target preference, run-queue reservation accounting, and
deferred drop/stack release slots.
Each live thread has at most one queued owner across all per-CPU queues
combined, and every per-CPU queue reserves capacity up to the live
runnable-capable thread count before a new thread is published as
runnable, so later timer, unblock, requeue, and steal-requeue paths do
not allocate. The shared live-reservation count is released when
processes or threads exit or when pre-publication reservation is rolled
back. Reserving each queue to the full live-thread count is required
because the bounded steal path may migrate every live thread into a
single sibling queue between two scheduler passes.

Phase D accepted its Task 6 diagnostic closeout at commit `77caafc0`
(`2026-05-10 19:39 UTC`, `docs(scheduler): record phase d thread-scale gate`)
and closed in docs commit `1a08ec23` (`2026-05-10 21:47 UTC`,
`docs(scheduler): close phase d`). The accepted
state is the WFQ scheduler described here: per-thread weights and
latency classes are mutated only through `SchedulingPolicyCap`, each
per-CPU runnable queue is ordered by freshly derived
`virtual_finish_ns`, migration preserves `virtual_runtime_ns`, and
bounded stealing selects the most-overdue runnable sibling candidate.
The controlled Task 6 benchmark pair on `capos-bench` recorded capOS
1-to-4 work/total speedups `3.088x` / `2.700x` versus the previous
single-global-queue baseline `1.566x` / `1.538x`; the matching Linux
pthread baseline on the same host and physical-core logical CPUs
`0,1,2,3` recorded `3.974x` / `3.850x`. The host harness enforced the
configured 1-to-2 work/total gates; the 1-to-4 row was manually accepted from
recorded diagnostics. Phase E
`SchedulingContext` is the next scheduler authority phase; EEVDF is a
follow-on ordering-policy evaluation rather than a Phase D blocker.

Phase D Task 3 (2026-05-07) restored the per-CPU runnable queues that
the 2026-05-02 collapse retired and gave them the WFQ ordering Task 2's
`virtual_finish_ns` was prepared for. Newly created processes and
threads publish onto the creating scheduler CPU's per-CPU queue; the
bounded steal path balances the queues when other CPUs run out of local
work. The publish-time placement is intentionally simple in this slice
— "place locally, let steal balance" — and a more sophisticated
caller-aware spread or least-loaded scan is a milestone-gate follow-up,
not a Task 3 acceptance requirement. Wake policy carries
`WakePolicy::QueueCpu(u32)` for endpoint, timer, park, process-wait,
thread-join, and process-spawn completions so the wake target matches
the queue placement, and `DirectTarget` keeps its original direct-IPC
handoff role. The transitional `CAPOS_SCHED_DISABLE_WFQ=1` /
`WakePolicy::QueueAny` fallback has been removed before Phase E
`SchedulingContext` schema work.

`wake_idle_scheduler_cpus_locked` first probes the placement target
when the policy is `QueueCpu`, then walks eligible idle scheduler CPUs
and wakes the first that accepts a fresh reschedule IPI, skipping CPUs
that already have a pending IPI so a burst of ready work cross-wakes
more than one neighbor instead of stranding the rest behind one
already-targeted CPU.

### Ring SQ Consumer Ownership

Each ring endpoint has kernel-owned SQ-consumer metadata outside the writable
userspace ring page. `cap_enter` and the bounded timer-side current-thread ring
service both acquire a syscall-mode owner lease before calling
`process_ring()`. The lease carries a nonzero generation and owner identity;
`process_ring()` verifies that generation before flushing deferred ring work or
advancing SQ head, and stale owners return `StaleSqConsumer` without consuming
the head SQE. Duplicate owners fail closed as a retryable busy `cap_enter`
status.

CQ publication remains independent of SQ ownership. Already accepted
completions stay visible through CQ head/tail even after the SQ owner releases,
and thread/process teardown releases any live SQ owner before ring unmapping or
record drop without clearing accepted CQEs.

### Bounded SQPOLL ring mode

Phase F adds a bounded SQPOLL mode for the caller thread's ring through
`CpuIsolationLease` with `allowedMode = kernelSqpoll` and `namedRing =
callerThread`. The transition is explicit: syscall-owned dispatch may request
SQPOLL start while it still owns the SQ, then releases its generation-checked
owner; the poller finalizes into `SqpollRunning`, may publish
`NEED_WAKEUP` and enter `SqpollSleeping`, wakes back to running when a producer
publishes a new SQ tail, and stops or rolls back on lease revoke, cap release,
teardown, or failed start. Timer-side syscall-mode ring service fails closed
while SQPOLL owns the same endpoint, so no second SQ consumer can advance the
SQ head.

The Phase F poller runs from the periodic scheduler service path and from a
bounded current-thread syscall service entry used for SQPOLL producer wakes and
explicit syscall kicks. Both entries borrow the SQPOLL owner lease rather than
acquiring syscall SQ ownership. The current default admits two SQEs per
selected SQPOLL worker, and a worker is not reselected again in the same
periodic service pass or syscall service entry. Poller elapsed time is charged
to the admitted scheduler ledger or scheduling-context target. The wake/sleep
protocol uses a shared ring flag: the poller
publishes `NEED_WAKEUP`, performs a full ordering barrier, and rechecks SQ
tail before sleeping; producers publish initialized SQEs, store SQ tail with a
barrier, and enter the kernel if `NEED_WAKEUP` is visible. A `cap_enter`
producer wake that finds SQPOLL already owns SQ head can run one bounded SQPOLL
batch, return visible CQ availability when the requested threshold is
satisfied, preserve ordinary blocked-current-thread and thread-owned-head
results, and otherwise fail closed as a retryable busy result. Stale owner
generations fail before deferred ring work or SQE start. If teardown requests
stop after a live owner has already accepted a SQE, the poller still publishes
SQ head for that accepted SQE before releasing ownership, preserving accepted
CQEs without leaving work replayable by syscall mode. The focused
`make run-scheduler-generic-sqpoll-nohz` proof admits this explicit
ring-coupled shape into SQPOLL nohz, drives producer wake and bounded service
progress without depending on a periodic tick, then rolls back on stale
owner/lease revoke. Policy-service automatic nohz, broader
userspace-poller/device-queue admission, and production realtime admission
remain future work.

### Per-CPU run queue ordering structure

Each per-CPU `VecDeque<ThreadRef>` is kept ordered ascending by
`Thread.virtual_finish_ns`. Enqueue performs an ordered insert via a
linear scan from the front; selection scans the queue by index for
the first destination-Runnable entry (via
`pop_first_runnable_local_locked`), removes Drop entries it walks
past, and leaves RetryLater entries undisturbed for the next
scheduler pass. Because the queue is ordered ascending, the
first Runnable hit is also the lowest-`virtual_finish_ns` candidate
the destination CPU can accept (the most overdue against fair share
that this CPU is allowed to run). Linear-scan insert is O(n) per
enqueue;
with `SCHEDULER_CPUS = 4` and bounded thread counts in this slice the
constant is small enough to defer a smarter structure (sorted bucket
arrays, intrusive trees) until benchmark evidence shows it dominates
scheduler-lock hold time. Promoting to a smarter structure is a
follow-up under this plan if the Task 6 milestone gate proves the
need.

`virtual_finish_ns` is recomputed on every enqueue from the thread's
current `virtual_runtime_ns`, `weight`, and `latency_class`; it is
never carried as committed state across blocking, and migrations
between per-CPU queues recompute it at the destination so the
destination's view of fair-share progress applies. The derivation rule
per latency class is documented in `capos-abi/src/scheduler.rs` and
the "Latency-class semantics for Phase D" section of
`docs/proposals/scheduler-evolution-proposal.md`.

### Bounded steal path

When a CPU's local queue has no immediately runnable entry the
scheduler walks sibling per-CPU queues. For each sibling queue the
scan walks indices ascending and selects that queue's first entry
that the destination CPU considers `Runnable`; because each queue is
ordered ascending by `virtual_finish_ns`, the first Runnable hit is
also the lowest `virtual_finish_ns` candidate available to the
destination on that source queue. The steal then picks the source
queue whose first-Runnable candidate has the **lowest**
`virtual_finish_ns` overall, with ties broken by lower CPU id. The
chosen entry is removed from its current position in the source
queue (not necessarily the head: a RetryLater or single-CPU-owner
thread may sit at the source's front and stay there), the WFQ tag is
recomputed at the destination, and the entry is inserted at the
destination's ordered position. The destination queue is reserved to
the full live-thread count, so the steal-requeue is allocation-free.
The scan walks at most `SCHEDULER_CPUS * max_queue_len`
entries, but in practice each sibling scan stops at the first
Runnable candidate per queue.

### RetryLater semantics in the local scan

The local pop scan walks the per-CPU queue by index instead of
popping the front and re-pushing RetryLater candidates. Re-pushing a
RetryLater entry whose `virtual_finish_ns` has not changed would
ordered-insert it back at the same head position, so a naive
pop-then-requeue loop would re-pop the same RetryLater head every
iteration and starve runnable entries behind it. The index scan
removes Drop entries in place, leaves RetryLater entries undisturbed
for the next scheduler pass to re-evaluate, and returns the first
Runnable candidate it finds. The bounded steal path uses the same
index scan on the destination queue after a steal so a stolen
RetryLater entry does not get re-popped in the same dispatch pass.

### Phase E preflight fallback cleanup

The one-bisect-cycle `CAPOS_SCHED_DISABLE_WFQ=1` opt-out has been
removed. Enqueues always target the selected per-CPU WFQ queue, and
wake-up sites always carry `WakePolicy::QueueCpu(slot)` for queued
work. Phase E `SchedulingContext` work therefore starts from the
accepted Phase D WFQ behavior rather than from a source-level
single-global-queue fallback.

### Phase E Task 1: scheduling-context object shape

The first `SchedulingContext` slice is info-only: schema, config,
runtime, and kernel code expose `SchedulingContext.info()` and a
bootstrap grant shape, but no dispatcher enforcement, replenishment,
donation/return, depletion notification, realtime island, SQPOLL, or
nohz behavior. `SchedulingContextSpec.cpuMask` uses the canonical
little-endian bitset defined in `schema/capos.capnp`: CPU `n` maps to
bit `n % 8` of byte `n / 8`, with bit 0 as the least-significant bit
of that byte. Empty data means no CPUs are selected rather than all
CPUs. Producers omit trailing zero bytes, so the all-zero set's
canonical form is empty and any non-empty canonical mask ends with a
nonzero byte.

### Phase E Task 2: bind, revoke, and generation identity

The second `SchedulingContext` slice adds the first bounded authority
lifecycle. `SchedulingContext.create()`
creates a same-interface result cap for a validated spec, `bindCallerThread()`
records one caller-thread binding for the current context generation, and
`revoke()` advances the generation and clears the matching thread metadata
binding. Bootstrap-granted contexts and contexts returned by `create()` use the
same non-wrapping context-id allocator; the binding identity remains
`(contextId, generation)`, but distinct cap objects no longer share bootstrap
ids. Stale caps report `staleGeneration` and cannot create, bind, or revoke
scheduler metadata for a new generation; already-revoked contexts report
`revoked`. Release cleanup clears only a thread metadata binding that matches
the released cap identity.

### Phase E: SchedulingContext budget enforcement

`make run-scheduling-context` is the focused Phase E QEMU proof. It
starts one process with two independently granted bootstrap contexts, verifies
their identities cannot alias, adopts a created result cap, drives bind/revoke
and stale-generation calls, confirms release cleanup by rebinding after the
released cap drops, and now checks the first dispatcher budget behavior.
`bindCallerThread()` installs a fixed budget ledger in the caller thread's
scheduler metadata. Runtime charge decrements that ledger at the same
scheduler-lock-contained points that update per-thread runtime/vruntime.
Runnable selection replenishes elapsed periods and treats exhausted bound
contexts as `RetryLater` until their next period, leaving the queued owner in
place rather than allocating or moving emergency-path state. Stale or revoked
contexts still fail closed before mutating scheduler metadata or accounting.

The current enforcement granularity is the existing periodic scheduler tick:
a running thread may overshoot its budget by the current tick quantum before
the next dispatch charge throttles it. The smoke therefore proves bounded
dispatcher behavior, not nohz/SQPOLL activation or hard realtime admission. It
prints `dispatch_effect=budgetEnforced`, visible budget charge, replenishment
to full budget after a period, and a throttled wall-clock window.

### Phase F: CpuIsolationLease and automatic nohz activation

`CpuIsolationLease` is a separate authority surface from
`SchedulingContext` CPU-time budget enforcement. The scaffold records owner
identity, allowed CPU set, allowed isolation mode, live accounting target
reference, housekeeping exclusions, maximum revocation latency, and generation
identity. It rejects stale generations, duplicate or overlapping active leases,
fabricated or stale `SchedulingContext` accounting targets, malformed CPU masks,
and lease sets that would leave no online scheduler housekeeping CPU outside
the globally admitted active lease CPUs.

The scheduler-side preflight reports a bounded nohz activation/deactivation
decision surface: lease identity, target CPU mask, target runnable entity
count, active housekeeping CPU availability after subtracting all active lease
CPUs, selected housekeeping CPU mask, deferred cleanup, timer/deadline,
network polling, IRQ-affinity, accounting-target, monotonic
clocksource/accounting readiness, one-SQ-consumer, revocation latency,
rollback, and periodic-fallback labels. The accepted QEMU proof uses `-smp 4`
so an active lease can report ready housekeeping CPUs outside the target CPU,
selected housekeeping placement, and exactly one runnable caller on that
target CPU.

The clockevent/deadline substrate uses a calibrated TSC-backed monotonic
clocksource on normal QEMU/x86_64, with the periodic LAPIC tick disciplining
the TSC epoch so QEMU guest halt windows cannot stall wall-clock progress.
`Timer.sleep`, finite `cap_enter`, and park timeouts store absolute monotonic
`deadline_ns` values, and the LAPIC clockevent backend can program a bounded
one-shot deadline and restore periodic mode.

#### Automatic nohz activation state machine

When the preflight finds every proof obligation satisfied -- a single
runnable entity on the target CPU, a ready housekeeping CPU outside the lease,
no local deferred-cleanup/timer dependency, a valid accounting target, a live
monotonic clocksource, a non-stale one-SQ-consumer when a ring is named, a
bounded revocation latency, and the lease's `allowedCpuMask` naming exactly
one scheduler-owned CPU -- it performs **real per-CPU periodic-tick
suppression** for that narrow single-runnable window. The target CPU may be
the CPU running the preflight call (local activation) or a different
scheduler CPU (remote-CPU activation via a reschedule IPI -- see *Remote-CPU
activation* below). The single-runnable shape differs by target: a local
activation requires the caller itself to be that single entity
(`exactly-one-runnable-caller`); a remote activation requires the target
CPU's single runnable entity to be some thread pinned there, not the caller
(which runs on a different CPU -- `exactly-one-runnable-remote-target`).

- **Admission gates.** Two lease shapes can be admitted for tick suppression:
  a pure `namedRing = none` compute lease, and a ring-coupled
  `allowedMode = kernelSqpoll` lease whose bound ring is being actively driven
  by a live SQPOLL consumer.
  - *Compute lease (`namedRing = none`).* Declares no local network/IRQ
    dependency, so the read-only network-polling and IRQ-affinity admission
    gates pass.
  - *Ring-coupled SQPOLL lease (`allowedMode = kernelSqpoll`,
    `namedRing = callerThread`).* The lease's declared kernel-polled work IS
    the bounded SQPOLL ring poller, which the scheduler keeps progressing
    through `cap_enter`/producer-wake even while the periodic tick is masked.
    The preflight admits it only when the bound ring is in SQPOLL
    running/sleeping mode with a non-stale `Sqpoll` owner; the one-SQ-consumer
    label is then `blocked-sqpoll-owner` (the worker owns the ring). The
    preflight ring-state read is a **best-effort hint** -- it never takes the
    per-ring lock inside the scheduler lock (it uses `try_lock`, and a
    contended snapshot does not admit activation). The decisive disqualifier
    is the IPI/timer re-check below.
  - A `namedRing = callerThread` lease that is *not* `kernelSqpoll`
    (compute-with-ring) keeps the conservative refusal until network polling
    and IRQ affinity are routed to a housekeeping CPU, as does any
    device-owning mode. The kernel still services virtio RX/TX and `Interrupt`
    waiters inline from the periodic scheduler path.
- **Activate.** The preflight masks the periodic LAPIC timer on the current
  CPU and arms a one-shot deadline at `min(nearest pending timer wakeup,
  now + max revocation latency)`. The CPU now runs on a bounded one-shot
  deadline instead of the periodic tick. The eligible lease generation is
  registered so revoke/cleanup paths can stale it.
- **Re-check.** On every timer interrupt and on every reschedule IPI the
  handler re-checks the activation window before the scheduler picks the next
  thread. The reschedule-IPI handler also drains any pending remote-CPU
  activation request parked for this CPU (the IPI vector is shared with the
  remote-activation path -- see *Remote-CPU activation* below), and the
  periodic timer handler drains it too as a backstop.
  An unchanged eligible window re-arms the bounded one-shot deadline;
  a reschedule IPI (the prompt signal that another CPU woke runnable work onto
  this CPU) drives an immediate rollback. The re-check runs in interrupt
  context and uses `try_lock` to avoid deadlocking against a held scheduler
  lock. **Armed-timer invariant:** the masked-periodic one-shot does not
  auto-rearm, so a timer-interrupt re-check NEVER returns leaving a tickless
  CPU without an armed timer -- on scheduler-lock contention it arms a bounded
  minimum-delta fallback one-shot (or restores the periodic tick) before
  returning. A lock-free per-CPU `nohz-active` bitmask lets the contention
  path distinguish a tickless CPU (the consumed timer was the nohz one-shot
  and must be replaced) from a normal CPU (the periodic tick auto-rearms). A
  reschedule IPI does not consume the one-shot, so its contention skip is safe
  -- the still-armed one-shot bounds the next re-check.
- **Rollback.** Any disqualifying change rolls the CPU back to the periodic
  LAPIC tick *first*, before any further ordinary work: a stale lease
  generation (explicit revoke, process exit, service replacement, session
  logout), a second runnable entity or stealable sibling work on the target
  CPU, a local deferred-cleanup dependency, a direct-IPC target becoming
  runnable, a target-CPU mismatch, or a one-shot backend that can no longer
  arm a deadline. For a ring-coupled SQPOLL activation the re-check also
  carries a `sqpoll-ring-mode-changed-or-owner-staled` disqualifier (the bound
  ring leaving SQPOLL running/sleeping mode or its owner staling); that
  re-check runs under the scheduler lock and uses `try_lock` on the per-ring
  lock, so a contended ring is treated as disqualifying (fail-closed --
  restore the periodic tick rather than keep a CPU tickless on an unverifiable
  ring). That SQPOLL ring-mode branch is **defense-in-depth, currently
  subsumed by lease-generation staling**: every reachable SQPOLL-stop path
  today (`stop_sqpoll_for_lease` / `stop_sqpoll_if_owned`) is a
  revoke/cleanup-path caller that also stales the lease, and
  `stale-lease-generation` is checked first -- so the lease-generation stale
  is the load-bearing SQPOLL rollback trigger in practice. The SQPOLL
  ring-mode branch becomes independently load-bearing, and would then need its
  own proof, only if a future change introduces a SQPOLL-stop path that keeps
  the lease live. Runtime accounting stays boundary/counter driven and
  monotonic, so suppressing the tick never strands `SchedulingContext` budget
  charging.

##### Remote-CPU activation

Masking the periodic LAPIC tick and arming the one-shot deadline are per-CPU
operations -- only the target CPU can program its own LAPIC timer. When the
preflight runs on CPU A but the lease's single-CPU `allowedCpuMask` targets a
different CPU B, the kernel does **not** refuse: it parks a bounded
remote-activation request in CPU B's per-CPU slot and sends a
reschedule-style IPI to CPU B. CPU B drains the request from its IPI handler
(and from its periodic timer handler as a backstop), re-runs the full
disqualification check **locally** under its own scheduler-lock acquisition,
and only then arms its own one-shot deadline. A remote activation is never
trusted blind -- the preflight's eligibility snapshot was taken on a
different CPU and may be stale by the time the IPI is drained, so the target
CPU re-checks before committing. The relevant invariants:

- **Bounded request slot, no nesting.** The pending-request store is a fixed
  `[Option<_>; SCHEDULER_CPUS]` array -- one single-entry slot per CPU, so it
  can never grow unbounded. If a slot already holds an undrained request, a
  new preflight fails closed (`rejected`) rather than queuing behind it. The
  IPI-context drain never nests the scheduler lock: it takes only the small
  per-CPU slot mutex, then calls the activation in `try_lock` mode.
- **Contention retry.** If the IPI-context drain finds the scheduler lock
  contended, it leaves the request parked and returns; the target CPU's next
  periodic timer tick (still live -- the tick has not been suppressed) retries
  the drain. Progress is bounded by the periodic tick the same way the
  existing local re-check contention path is.
- **Fail-closed IPI ordering.** A remote rollback
  (`rollback_nohz_for_lease`) stales the lease generation *before* clearing
  the activation record. The drain re-checks the generation before arming, so
  a rollback that races the drain fails closed (the request is dropped, the
  periodic tick stays live). If the drain already committed before the
  rollback cleared the record, the target CPU's next `nohz_recheck` sees the
  `nohz-active` bit set with no record and restores its periodic tick. Either
  ordering converges on the periodic tick.
- **Compute-only.** Remote-CPU activation is limited to `namedRing = none`
  compute leases in this slice. A ring-coupled SQPOLL lease whose target
  differs from its ring owner's CPU is not an admitted shape; it fails closed.

Generic full-nohz admission for ordinary budgeted compute threads is available
only through an explicit `SchedulingContext`-targeted compute lease and the same
fail-closed placement gates described above. The SQPOLL nohz state machine now
admits explicitly leased caller-thread rings when the SQPOLL worker is live,
single-consumer, and bounded by producer wake/deadline rollback. Broader
userspace-poller/device-queue admission, automatic CPU-isolation issuance, and
production realtime island admission remain future work; `auto_nohz` stays
disabled. Timeout-based auto-revoke landed 2026-05-30 15:22 UTC: a `CpuIsolationLease`
created with `leaseLifetimeNs > 0` records an absolute expiry deadline,
auto-revokes through the existing generation-advancing cleanup on first
observation past it (`reason=lease-expired`), and the nohz activation record
carries the lifetime deadline so a tickless CPU rolls back at the next
timer/IPI recheck (`lease-lifetime-expired` disqualifier), bounded by
`maxRevocationLatencyNs`. A `leaseLifetimeNs` of `0` preserves the prior
revoke/cleanup-only lifecycle. The current
SQPOLL-driven activation is the bounded case: tick suppression for a
ring-coupled `kernelSqpoll` lease on the CPU running the preflight, rolled
back through lease-generation staling on revoke/cleanup, with the SQPOLL
ring-state re-check as defense-in-depth for any future SQPOLL-stop path that
does not stale the lease.

Lease revocation and cleanup are generation-aware. Explicit revoke, process
exit, service replacement through process termination, and session logout stale
the matching generation so old caps cannot keep isolation eligibility alive,
and rolling the matching lease's active nohz window back to the periodic tick
is part of the same cleanup path.
`make run-scheduler-cpu-isolation-lease` is the broad QEMU proof for grant,
info, revoke, cleanup, real nohz activation and fail-closed rollback, bounded
SQPOLL start/sleep/stop, rollback labels, generic full-nohz, and SQPOLL nohz.
`make run-scheduler-generic-sqpoll-nohz` is the focused SQPOLL proof for
eligible ring admission, producer wake, SQPOLL service, rollback, and stale
owner rejection.

### Phase E: endpoint donation and return

Synchronous endpoint delivery now carries a bounded internal donation token
when a caller thread with a bound active `SchedulingContext` delivers a CALL
to a receiver thread that has no scheduling context of its own. Donation is
strictly passive-server shaped: receivers that already have a scheduling
context keep their own authority, unbound callers donate nothing, and callers
that receive a donation token are blocked from returning to userspace until
the in-flight endpoint call returns or is canceled.

At delivery, the scheduler charges pre-donation caller runtime before moving
the context ledger to the receiver. While the receiver handles the endpoint
message, normal dispatcher runtime charging decrements the donated context.
When endpoint RETURN commits the caller completion, the scheduler first charges
receiver runtime since dispatch, then returns the remaining budget and
next-replenishment state to the caller's thread metadata and rebinds the
`SchedulingContext` record to the caller. Return preflight failures leave the
in-flight donation in place, while application-exception RETURN,
invalid-result RETURN errors, delivery failure, return cancellation, endpoint
teardown, process/thread exit, and stale-caller cleanup return or clear the
donation before waking the caller and without allocating new emergency-path
storage. Nested donation of an already donated context is rejected; supporting
stacked donation is deferred until it has an explicit return-token stack
design.

`make run-scheduling-context` proves the behavior with a same-process endpoint
round trip. The caller binds a fresh context, burns CPU immediately before
CALL, the passive server burns CPU while servicing the endpoint CALL and again
immediately before RETURN, and after RETURN the caller observes the reduced
budget restored. The same smoke covers application-exception RETURN,
oversized-result RETURN under donation, and deterministic rejection of
A-to-B-to-C nested donation. It also submits a delivered donated CALL and then
uses `cap_enter(0, 0)` while the server delays RETURN, proving the donor cannot
continue outside the donated ledger. A fast-return variant covers the race where
the receiver returns before the caller commits to the donation-blocked scheduler
state. The smoke prints `endpoint_donation=ok`, `endpoint_return=ok`,
`endpoint_exception_return=ok`,
`endpoint_invalid_return=ok`, `endpoint_nested_rejected=ok`,
`endpoint_donor_block=ok`, `endpoint_donor_fast=ok`,
`endpoint_donation_server`, `endpoint_donation_after`,
`endpoint_exception_return_after`, `endpoint_invalid_return_after`,
`endpoint_nested_after`, `endpoint_donor_block_elapsed_ns`,
`endpoint_donor_block_after`, `endpoint_donor_fast_elapsed_ns`, and
`endpoint_donor_fast_after`.

### Phase E: SchedulingContext notifications

Every `SchedulingContext` now owns fixed notification storage allocated at
context creation or bootstrap. The storage has two coalescing slots:
`budgetDepleted` and `deadlineOrTimeout`. Each slot records context
id/generation, a saturating sequence, a saturating coalesced-event count, the
last holder thread, remaining budget, the next replenishment/deadline
timestamp, and whether the holder was using an endpoint-donated context.
Runtime charge records depletion when remaining budget transitions to zero and
records deadline/timeout expiry against the same context generation. Failed
bind attempts do not arm a new budget/deadline window.

`SchedulingContext.drainNotifications()` returns typed observer results:
`ok` drains the matching fixed cells, `revoked` reports the current revoked
generation, and `staleGeneration` reports an old observer generation without
draining the current record. Explicit `revoke()` records an `explicitRevoke`
lifecycle event. These notifications explain already-enforced scheduler state;
they do not donate budget, reorder runnable entities, bypass throttling,
publish result caps, append unbounded queues, allocate on scheduler hard paths,
or imply auto-nohz/SQPOLL/tickless behavior. A pre-armed observer waiter/wakeup
path remains a future extension.

`make run-scheduling-context` proves the notification slice by repeatedly
draining a depleted context after coalescing, observing deadline expiry,
recording explicit revoke and stale-observer labels, and confirming that
endpoint-donated runtime records notification state on the donated context. The
smoke prints `notification_coalescing=ok`, `deadline_notification=ok`,
`revoke_notification=explicitRevoke`, `stale_notification=staleGeneration`,
and `endpoint_donated_notification=ok`.

### Phase E: session logout lifecycle hook

`UserSession.logout()` now notifies the scheduler after the session liveness
cell transitions from live to logged out. That covers explicit
`UserSession.logout()` calls, including the remote DTO gateway logout command
and connection-teardown path because those paths already call the same kernel
`UserSession.logout()` method. The hook scans scheduler-owned process/thread
metadata for live processes whose immutable `SessionContext` shares the logged
out liveness cell, removes each non-donated matching thread binding from the
scheduler ledger, and asks the bound `SchedulingContext` record to advance its
generation and mark itself revoked. Old ordinary `SchedulingContext` grants
therefore report stale generation through `info()` with zero visible remaining
budget and `InfoOnlyNoDispatchChange`. The focused session-context smoke also
proves stale `bindCallerThread()` does not rebind, stale `create()` does not
publish a result cap, stale `revoke()` does not mutate the current metadata
generation, and stale notification draining reports a stale observer result.

The hook intentionally does not use session code as a second scheduling-context
ledger: session lifecycle code only flips liveness and notifies the scheduler,
and the scheduler owns the scan and binding removal. The scan takes one binding
at a time under the scheduler lock, drops that lock, then calls the
`SchedulingContextExitCleanup` record hook so it does not invert the existing
`SchedulingContext` record-lock to scheduler-lock order used by
`bindCallerThread()`.

In-flight endpoint donation uses a conservative counted/skipped logout policy.
If the logged-out session owns a receiver thread that currently holds a
donated context, the logout hook records that the donated binding was skipped
rather than returning donor budget while the endpoint call remains in flight.
The focused session-context smoke proves the donor remains blocked in
`cap_enter(0, 0)` until the receiver returns, the hook reports
`donation_inflight_skipped=1`, and endpoint RETURN removes the receiver
binding while restoring only the reduced remaining budget to the donor. This
does not add a new logout-triggered cancellation semantic. Local owner-shell
exit now calls the held `UserSession.logout()` before clean shell process exit,
so the same scheduler hook observes shell logout with
`stale_marked=0 donation_inflight_skipped=0` in the shell smoke. The ordinary
bound-context stale proof remains the focused session-context smoke, because
the normal shell does not hold a bound `SchedulingContext`. Process and thread
exit cleanup already have their own stale-context coverage and are unchanged.

Realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain
future Phase F/G work.

### Phase D Task 4: migration fairness invariants

Phase D Task 4 (2026-05-08) made three migration-fairness invariants
explicit:

- **`virtual_runtime_ns` travels with the thread.** It lives on
  `Thread.cpu_accounting`, not on a per-CPU slot, so a migration from
  CPU A to CPU B preserves the thread's accumulated weighted-fair
  share. The accounting field was promoted out of `cfg(measure)` in
  Task 2 and continues to advance through `charge_runtime` regardless
  of which CPU charges the quantum.
- **`virtual_finish_ns` is derived per enqueue, never committed.**
  Every enqueue site -- the initial publish in
  `enqueue_ready_thread_on_slot_locked`, the post-block requeue in
  `enqueue_unblocked_thread_on_slot_locked`, and the steal-insert in
  `steal_from_sibling_queues_locked` -- routes through
  `refresh_virtual_finish_ns_locked`, which reads
  `thread.weight`, `thread.latency_class`, and
  `thread.cpu_accounting.virtual_runtime_ns` fresh and recomputes the
  WFQ ordering tag. The field is never carried as committed state
  across blocking and is never carried with the thread on migration;
  the destination CPU's view of weight, latency class, and quantum
  decides the new tag.
- **Steal recomputes at the destination.** The pop-from-source step in
  `steal_from_sibling_queues_locked` is followed by
  `refresh_virtual_finish_ns_locked` against the destination slot
  before the ordered insert, so a `SchedulingPolicyCap.setWeight` that
  landed between source enqueue and steal takes effect at the steal
  itself.

#### Migrations counter shape

`ThreadCpuAccounting.migrations` is `cfg(feature = "measure")`-gated
and remains a benchmark-only operator-observability counter; it is
not load-bearing for ordering and is not exposed through
`SchedulingPolicyCap.snapshot`. Phase D Task 4 moved the increment
from the dispatch-time `scheduled_measure` path to two enqueue-time
arms in `kernel/src/sched.rs`:

- **Placement-time spread** (`record_placement_spread_migration_locked`)
  fires from `push_reserved_run_queue_locked` when the enqueue target
  slot differs from the thread's previously dispatched CPU
  (`ThreadCpuAccounting.last_cpu`). A thread that has never been
  dispatched (`last_cpu == None`) does not register a migration on
  first publish; otherwise placement spread is counted exactly once
  per enqueue.
- **Steal** (`record_steal_migration_locked`) fires from
  `steal_from_sibling_queues_locked` after the source-queue removal
  and before the destination-queue insert. The steal scan skips the
  destination slot, so the counter increments unconditionally each
  time the steal arm is reached.

`scheduled_measure` still maintains `last_cpu` so the placement-spread
check has the previous CPU available; only the migrations++ moved.
The pre-collapse counter shape is preserved in steady state -- a
thread that runs on a different CPU than its previous run still
records exactly one migration -- but the increment is now attributed
to the enqueue decision (placement spread or steal) rather than the
dispatch that follows it.

The aggregate process-wide `thread_placement` counter family in
`kernel/src/measure.rs` (`migrations`, `migration_to_cpu0..3`,
consumed by `tools/qemu-thread-scale-harness.sh`) is a separate
measurement device. It is incremented from
`account_thread_selected_locked` at dispatch time and continues to
observe "thread ran on a different CPU than its previously
dispatched CPU" rather than the per-thread Task 4 enqueue-time
shape, so the thread-scale harness regex does not need to change.
The per-thread `ThreadCpuAccounting.migrations` field and the
aggregate `thread_placement` counter intentionally measure different
events at different points in the scheduling pipeline; both stay
behind `cfg(feature = "measure")`.

### Phase H: per-thread saturation status surface

The Phase H AutoNoHz placement heuristic (a future policy-service
feature) needs to read per-thread saturation observation in the normal
dispatch build, not only under `cfg(feature = "measure")`. The
non-`measure` per-thread saturation status surface (2026-05-30)
promoted the inputs it consumes into ordinary `ThreadCpuAccounting`
state and exports them through `SchedulingPolicyCap.snapshot @2`:

- **`voluntary_blocks`** and **`preemptions`** moved out of
  `cfg(feature = "measure")`. They are charged at the same sites as
  before -- `voluntary_blocks` when a thread blocks itself (cap_enter
  wait, park, endpoint scheduling-context donation) and `preemptions`
  when the timer requeues a still-runnable running thread -- so the
  `measure` build's counts are unchanged; only the `cfg` gate was
  removed. A low `voluntary_blocks` count distinguishes a CPU-saturating
  thread from an IPC/IO-bound one.
- **`runnable_accumulated_ns`** is a new always-built cumulative counter
  of runnable-but-not-running time. It is charged at the
  scheduler-lock-held enqueue/select boundary: `push_reserved_run_queue_locked`
  stamps a monotonic `runnable_since_ns` when a thread is published to a
  per-CPU run queue without being selected (idempotent across re-publish,
  so the whole runnable span is counted once), and
  `account_thread_scheduled` accumulates the monotonic delta and clears
  the stamp when the thread is next selected. The stamp/accumulate pair
  nets to zero for a thread selected at the same monotonic instant it
  becomes runnable. The clock is `monotonic_ns()` only (no wall-clock,
  no rewind), matching `charge_runtime`'s discipline, and the stamp
  respects the runnable-ownership rules above (a thread holds a live
  stamp only between enqueue and selection).

`migrations` stays `measure`-gated; it is a placement diagnostic, not a
saturation input. The surface exports raw cumulative counters only --
windowing, smoothing, and the saturation decision are policy-service
choices, never kernel state (see
`docs/proposals/tickless-realtime-scheduling-proposal.md`). Proof:
`make run-thread-fairness` reads the extended snapshot on the weighted
workers and asserts the CPU-bound hog reports high `runtime_ns` with
`voluntary_blocks` at or near zero while at least one preempted
lower-weight worker reports nonzero `preemptions` and
`runnable_accumulated_ns`.

#### Weight-change-while-enqueued contract

`SchedulingPolicyCap.setWeight` writes the validated weight directly
to `Thread.weight` through `Process::set_thread_weight` and does not
clear `Thread.virtual_finish_ns`. A weight change observed while the
thread is blocked, running, or already queued takes effect on the
**next dequeue and re-enqueue** because every enqueue site refreshes
`virtual_finish_ns` from current `weight`/`latency_class`/
`virtual_runtime_ns`. The kernel proves the contract two ways:

- **By construction.** `Process::refresh_thread_virtual_finish_ns`
  reads each input field fresh on every call; there is no cached
  derivation between enqueues. The function bears a doc-comment
  asserting the contract.
- **By `debug_assert!`.** Inside the same function, a debug assertion
  verifies that the recomputed `virtual_finish_ns` is at or beyond the
  current `virtual_runtime_ns` -- a future deadline, never a past
  one. The assertion catches any future regression where the formula
  could underflow or where a stale cache could drift below the
  current vruntime.

The focused QEMU smoke that drives `setWeight` and verifies the
post-block dispatch picks up the new weight landed under Phase D
Task 5: `make run-thread-fairness-weight-change` (manifest
`system-thread-fairness-weight-change.cue`, demo
`demos/thread-fairness/`). Two competing child threads run a
fixed wallclock window: a baseline worker stays at
`DEFAULT_WEIGHT`, while a heavy worker self-calls
`SchedulingPolicyCap.setWeight(weight=128)` and then blocks on
`Timer.sleep` so it leaves the run queue before the contention
window opens. Each worker snapshots its scheduler state at wake
and at window end via `SchedulingPolicyCap.snapshot`, and the
parent verifies three independent properties: (1) the heavy
snapshot reads `weight == 128` and the baseline snapshot reads
`weight == DEFAULT_WEIGHT`; (2) the observed `runtime_ns` ratio
matches the weight ratio inside a configured tolerance; (3) the
heavy worker's `virtual_runtime_ns` advances at roughly half
the rate of its `runtime_ns` (vruntime/runtime ~= 0.5 for
weight=128, ~= 1.0 for DEFAULT_WEIGHT). A scheduler that
re-enqueued or dispatched the heavy worker using a stale
`virtual_finish_ns` derived from `DEFAULT_WEIGHT` would not
show the weight-proportional CPU share, and a scheduler that
held a stale weight inside `charge_runtime` would yield heavy
vruntime/runtime ~= 1.0 instead of ~= 0.5; the smoke trips on
either regression. The capability is bound to
`CapCallContext::caller_thread` (Phase D Task 2 decision), so
same-thread self-mutation is the only authorized shape for this
proof; cross-thread weight authority remains a Phase H
privileged scheduler-policy service concern.

The thread-scale benchmark was repaired before accepting the milestone. The old
1 MiB/spinning-parent shape was not a valid four-core reference because the
matching Linux pthread baseline also failed at four workers. The accepted
benchmark shape uses a blocking parent join, 262,144 blocks (16 MiB), and
`work_rounds=64`. The formal accepted-evidence pair is the `capos-bench`
2026-05-02 21:38 UTC 5-run pair pinned to physical-core logical CPUs
`0,1,2,3` against `main` commit `374f8556`: capOS work `1.883x` and total
`1.787x` clear the configured `1.6x` gates, while the matching Linux
pthread baseline records `1.988x`/`1.987x`. Its 1-to-4 row became the
diagnostic that justified Phase D's fair-share enqueue policy: capOS
`1.566x`/`1.538x` versus Linux `3.963x`/`3.858x`, a clear bottleneck
in the then-current single-global-queue scheduler. Phase D's WFQ evidence on
`2026-05-10` manually accepted the recorded 1-to-4 diagnostic with capOS
`3.088x`/`2.700x` and matching Linux `3.974x`/`3.850x` on the same host/CPU
pin set. The harness still enforced only the configured 1-to-2 work/total
speedup gates. Historical pre-collapse 1-to-2
(`1.828x`/`1.687x`) and the post-collapse 3-run diagnostic on
`capos-bench` 2026-05-02 10:42 UTC (`1.890x`/`1.792x`,
`1.504x`/`1.436x`) remain in `docs/benchmarks.md` for reference.
Four-worker capOS scaling was a follow-up rather than a completed claim
under the pre-collapse model: the unsuppressed diagnostic recorded 1-to-4
work/total speedups `3.029x`/`2.386x`, while suppressing scheduler switch
logs recorded `3.272x`/`2.303x`; remaining guest-measure evidence pointed at global
`Scheduler` lock contention plus exit/join/block/schedule overhead, and normal
scheduler-owned execution is still capped at temporary CPU slots 0-3.
Each process currently owns one or more `Thread` records; each thread owns its
saved CPU context, kernel stack, FS base, block state, and -- since Phase D
Task 2 -- the WFQ ordering inputs `weight: u16`, `latency_class: LatencyClass`,
and `virtual_finish_ns: u64`. The Phase D constants in
`capos-abi/src/scheduler.rs` set the defaults `weight = DEFAULT_WEIGHT` and
`latency_class = LatencyClass::Normal`, so unmodified workloads observe no
behavior change versus the pre-Phase-D scheduler. `virtual_finish_ns` is
recomputed on every enqueue (Task 2 ships the derivation; Task 3 will consume
it for ordered insertion) and is not meaningful while the thread is blocked.

Phase D Task 2 split the per-thread CPU accounting record so the WFQ-load-
bearing fields are available in the normal `qemu` build:
`runtime_ns`, `virtual_runtime_ns`, and `last_started_ns` are unconditional;
`context_switches`, `preemptions`, `voluntary_blocks`, `migrations`,
`last_cpu`, and the `*_runtime_stable_observed` and `blocked`/`exited`
bookkeeping stay behind the `measure` feature because they are pure
operator-observability counters that do not participate in dispatch ordering
and need a separate operator snapshot path. `runtime_ns` advances 1:1 with
elapsed CPU time, while `virtual_runtime_ns` advances by
`elapsed_ns * REFERENCE_WEIGHT / weight` so per-thread weight changes the
cumulative WFQ share rather than only the enqueue tag. The runtime-charge
path is invoked when a current thread stops running through timer preemption,
blocking `cap_enter` or park, thread/process exit, or direct switch/handoff
paths that select another current thread; the wrapping helpers in
`kernel/src/sched.rs` route through `Process::charge_thread_runtime` /
`Process::account_thread_scheduled` unconditionally now.

The `SchedulingPolicyCap` cap surface mutates these per-thread fields through
the **caller-thread fallback** binding selected in Phase D Task 2: every
method (`setWeight`, `setLatencyClass`, `snapshot`) routes to
`CapCallContext::caller_thread`, so a holder can only mutate or observe its
own running thread. Cross-thread or cross-process authority is reserved for
the Phase H privileged scheduler policy service. The
`SchedulingPolicyCap.snapshot` reply intentionally exposes only the four
fields promoted out of the measure feature gate;
`context_switches`/`preemptions`/`voluntary_blocks`/`migrations` are
benchmark-only and a future operator-observability slice may add them
through a separate cap. The BSP scheduler tick normally arrives through the
local APIC timer on vector 48 with LAPIC EOI after calibrating the LAPIC initial
count against PIT channel 2; if LAPIC setup or calibration is unavailable, the
kernel falls back to the legacy PIT/PIC IRQ0 path on vector 32. On each
user-mode timer tick (kernel-mode ticks bypass the scheduler entirely
through `kernel_timer_interrupt_handler`, as described under Design),
the kernel wakes timed-out or satisfied `cap_enter` and park waiters,
processes the current thread's ring endpoint in timer mode, saves the
current thread context, picks the next ready thread from the single
global run queue (the earlier per-CPU local-first / steal scan was
retired with the queue collapse), switches CR3 when needed, updates
the current CPU's kernel-entry stack through the per-CPU hook,
restores FS base, mirrors the next `ThreadRef` into the current
`PerCpu`, and returns to the next user context.

When APs are online and their LAPIC timers start, scheduler CPU slots 0-3 can
temporarily own scheduler/user execution. The earlier AP-owner proof kept the
BSP in kernel idle; the current same-process scaling slice allows sibling
threads with distinct ring endpoints to run on different scheduler CPUs while
processes that hold broad launch/authority caps or live endpoint objects
remain pinned to the legacy single-owner CPU. Additional APs beyond CPU 3 stay
in kernel idle until a later scheduler-owner policy replaces the temporary CPU
mask. The runnable queues are a per-CPU array of `VecDeque<ThreadRef>` shared
by the scheduler-owned CPUs under the global scheduler lock and ordered
ascending by `virtual_finish_ns`; process/thread metadata remains shared under
that lock. A bounded steal path migrates the most overdue sibling
candidate (each sibling queue's first entry that the destination CPU
considers Runnable) when a CPU's local queue has no runnable entry.

Syscall entry initializes kernel GS with `swapgs`, saves the user RSP through
the GS-relative `PerCpu.user_rsp` slot, and switches to the GS-relative
`PerCpu.kernel_rsp` slot. Normal syscall returns swap back before `sysretq`.
Blocking `cap_enter`, process `exit`, and `ThreadControl.exitThread` paths that
leave through scheduler `iretq` restore use `restore_context_after_syscall` so
GS ownership is returned to userspace before the next user context resumes.

`Timer.sleep` records a bounded scheduler waiter keyed by caller `ThreadRef`,
user data, and an absolute monotonic `deadline_ns`. Due sleeps validate the
thread generation, post an empty completion directly to the caller's CQ, and
then flow through the same blocked `cap_enter` wake scan as other completions.
Each process has a separate sleep waiter quota, so one Timer holder cannot fill
the global sleep queue by itself.

`ThreadControl.setFsBase` validates runtime-provided FS bases as user-canonical
addresses, updates the caller thread's saved FS base, and writes the CPU FS
base immediately when the caller is the running thread. There is no
process-global FS base; context switch treats FS base as per-thread state.

The initial thread still uses the compatibility ring at `RING_VADDR`, while
each spawned child thread receives a kernel-chosen ring mapping in the process
ring arena. Run queues, per-CPU `current`, direct IPC handoff, Timer sleep
waiters, process/terminal waiters, endpoint caller/receiver records, and
deferred cancellation CQEs store generation-checked `ThreadRef` values and
route completions to the target thread's ring endpoint. Process-owned thread
and kernel-stack ledger limits are enforced by `ThreadSpawner.create` before
additional thread records become runnable. The frozen contract is in
[In-Process Threading](threading.md). Park wait uses a separate
`Blocked(Park { ... })` reason and park timeout/wake completions use reserved
CQE credits before marking generation-checked waiter threads runnable. The
authority and ABI contract is in [Park Authority](park.md).

`cap_enter(min_complete, timeout_ns)` processes pending SQEs immediately. If
the requested completion count is not available and the timeout permits
blocking, the current thread enters `Blocked(CapEnter { ... })` and the syscall
entry path switches to another runnable thread.

The LAPIC user-timer path enters `sched::schedule()` unconditionally on
every tick. An earlier slice carried a bounded user-mode continuation
fast path with a per-CPU one-skip budget and a release/acquire
slow-path-required summary; that path has been retired (see
`docs/backlog/scheduler-evolution.md` "Cleanup: Retire Benchmark-Driven
Scaffolding Before Phase D"). The fast path saved at most one scheduler
entry every other tick on an uncontended single-CPU-effective scheduler
while paying for shadow-state publication on every slow-path exit, so
the simpler always-schedule shape is preferred until a future Phase D
or Phase F slice ships an evidence pair where the fast path measurably
reduces scheduler-lock hold time on a contended SMP run.

When endpoint delivery satisfies a blocked server RECV, the scheduler can set a
direct IPC target. The next scheduling decision runs that server before ordinary
round-robin work when it is ready and its `ThreadRef` generation still matches
the captured direct target. When the direct slot is unavailable, endpoint
completions fall back to the queued path with `WakePolicy::QueueCpu(slot)`
targeting the current CPU's per-CPU queue, so the wake scan probes the placed
CPU first.

## Design

The implementation keeps ring dispatch outside the global scheduler lock.
Timer dispatch extracts ring/cap/scratch handles, releases the scheduler lock,
processes bounded SQEs, then reacquires the scheduler lock to choose the next
thread. This prevents Cap'n Proto decode, serial output, and capability method
bodies from running under the global scheduler lock.

There is no longer a slow-path-required summary or a per-CPU skip
budget for the user-mode timer path. Every user-mode LAPIC timer tick
enters `sched::schedule()`, which services run-queue entries, direct
IPC targets, deferred process termination/drop and thread-stack
cleanup, Timer sleep waiters, and blocked threads with timer-backed
`cap_enter` or Park timeouts under the scheduler lock. Those timeout paths
compare absolute monotonic deadlines, but periodic ticks still decide when the
checks run. Ring SQEs and ordinary cap waiters run on the same per-tick
cadence. Kernel-mode timer ticks (e.g., on AP cores parked in the kernel idle
loop) still go through `kernel_timer_interrupt_handler`, which sends EOI
without entering the scheduler. The shared `advance_bsp_tick` helper still
increments the compatibility `TICK_COUNT` only on CPU 0; normal runtime
accounting and timeout comparisons use `monotonic_ns()` instead. Future per-CPU
fair-share slices may reintroduce a continuation path under explicit Phase D
or Phase F authority; until then the always-schedule shape keeps the
scheduler's authority over thread metadata and runnable ownership
single-source.

The runnable queues keep a single-owner contract behind the global
scheduler lock. A live generation-checked `ThreadRef` may have at most
one runnable dispatch owner across per-CPU `current`/`handoff_current`
slots, the per-CPU run queues, and the single `direct_ipc_target`
preference slot. Blocked waiters, sleep waiters, park waiters, endpoint
state, process waiters, and join waiters are not runnable owners; they
may make a thread ready only after liveness and generation checks
succeed.

Migration between per-CPU queues is represented as a scheduler-lock-
contained transfer, not as a second published owner. The source owner
is removed or popped first and the `ThreadRef` is then inserted in the
destination queue at the position determined by a freshly recomputed
`virtual_finish_ns`, or selected as the next running thread.
`virtual_runtime_ns` travels with the thread; `virtual_finish_ns` is
recomputed at every enqueue and never carried as committed state, so
weight or class mutations applied while the thread was blocked take
effect on the next dequeue and re-enqueue. Retry paths requeue the
candidate after dropping duplicate queued copies. Direct IPC keeps its
preference slot only while the target remains live and runnable; if
the direct target cannot run immediately, it falls back through the
normal queued-owner path on the current CPU's per-CPU queue.

Idle-to-runnable wake targeting reuses the same ownership boundary. A
thread that becomes ready through endpoint completion, timer sleep,
park wake, process wait, or thread join is pushed to the placement
target's per-CPU run queue, and `wake_idle_scheduler_cpus_locked` first
probes the placement target when the policy is `QueueCpu`, then walks
eligible idle scheduler CPUs to wake the first that accepts a fresh
reschedule IPI; CPUs that already have a pending IPI (or that fail
LAPIC delivery) are skipped without breaking the scan, so a burst of
ready work cross-wakes more than one neighbor instead of stranding the
rest behind one already-targeted CPU. Direct IPC uses the same path.
Measurement builds expose aggregate and per-phase counters for wake
scans, eligible idle CPUs, targeted CPUs, IPIs sent, already-pending
IPI skips, not-ready target skips, missing LAPIC targets, and send
failures.

Each per-CPU run queue is reserved up to the live runnable-capable
thread count before publication; the shared live reservation count is
released on process/thread exit or pre-publication rollback. Reserving
each queue to the full live-thread count is required because the
bounded steal path may migrate every live thread into a single sibling
queue between two scheduler passes. Timer preemption, unblock, direct-
IPC fallback, requeue, and steal-requeue paths therefore must not
allocate while the thread is already live.

Process and thread exit cleanup proves the removal side of that
ownership contract at the cleanup site. After removing queued owners
and clearing a matching direct IPC target, the scheduler lock remains
held while the kernel scans every per-CPU runnable queue and the direct
target slot; any stale exiting process or thread reference is a kernel
assertion failure. The focused spawn smoke asserts the corresponding
serial proof markers on exercised process and thread exit paths.

The Phase C migration order is constrained by hardware state, not only by
scheduler data structures. The first gate moved syscall entry/exit off
BSP-symbol-relative `PerCpu` fields and onto `KernelGsBase`/`swapgs` on user
syscall paths, including blocking `cap_enter`, `exit`, and
`ThreadControl.exitThread` paths that leave through `iretq` rather than the
normal `sysretq` epilogue. The second gate added xAPIC initialization, a
PIT-calibrated BSP LAPIC timer tick, LAPIC EOI routing, AP LAPIC
initialization, a LAPIC spurious-vector handler, and an IPI vector plus bounded
vector-49-only fixed IPI send primitive. The third gate added address-space
resident CPU masks, per-CPU pending full-TLB flush generations, completion
waits, and a vector-49 TLB shootdown handler for user page-table `map`,
`unmap`, and `protect`. The fourth gate split current-thread tracking into
per-CPU slots, registers AP `PerCpu` records for current-thread and syscall
stack mirrors, updates AP TSS.RSP0 on context switches, and hands the single
scheduler-owner role to AP cpu=1 when it is online with a programmed LAPIC
timer.

The LAPIC slice replaces the BSP-oriented PIT/PIC scheduler tick on supported
QEMU and hardware paths. `kernel/src/arch/x86_64/idt.rs` keeps vector 32 for the
PIT/PIC fallback, reserves vector 48 for LAPIC timer delivery plus vector 49 for
cross-CPU requests, and installs vector 255 for LAPIC spurious interrupts.
`pic.rs` can remap and mask all legacy IRQs once LAPIC ticks are active, and
`context.rs` sends LAPIC EOI or PIC EOI according to the active timer source.
The IPI vector now handles TLB shootdown requests and bounded reschedule
requests for AP idle-to-runnable handoff.

The TLB slice wraps user page-table mutations that can affect an address space
resident on another CPU. `AddressSpace::map`, `AddressSpace::unmap`, and
`AddressSpace::protect` still perform the local `x86_64` mapper flush, then
call the architecture shootdown helper with the address space's resident CPU
mask. The helper records pending full-TLB flush generations for online resident
CPUs other than the caller, sends vector-49 IPIs, and returns a completion token.
Capability handlers drop the address-space guard and enqueue completion work;
`cap_enter` and timer polling drain that queue after ring dispatch releases the
cap-table and scratch locks. This keeps a remote syscall that is contending on
the same process locks from blocking maskable IPI delivery forever. Capability
handlers reserve fixed-size deferred queue slots before page-table mutation, so
full queues fail closed as capability overload errors instead of surfacing after
rollback, unmap, or protect has already changed state. Drains flush the current
CPU before waiting so a CPU that is itself in the target mask cannot wait on its
own pending generation. Target CPUs drain the generation in the IPI handler, at
syscall entry, or before returning to userspace from syscall, timer, and
scheduler restore paths.
Generation counters avoid losing overlapping shootdowns while a target CPU is
already draining a prior request. This relies on kernel user-buffer access
continuing through address-space-locked HHDM copy/read helpers rather than raw
user virtual addresses while a delayed flush generation exists. Callers include
`VirtualMemoryCap` dispatch through `parse_map`, `parse_unmap`, and
`parse_protect`, plus `MemoryObjectCap::{map,unmap,protect}` in
`kernel/src/cap/frame_alloc.rs`. Scheduler CR3 handoff now marks the selected
address space resident on the current CPU, including AP cpu=1 during the AP
scheduler-owner proof.

### Idle paths

There are two distinct idle paths, and both run genuine **CPL0 (kernel-mode)**
idle. There is no user-mode idle process: when no real work is runnable a CPU
runs the kernel idle code at CPL0 on the kernel PML4. The two paths differ only
in how the CPU got there.

The **cooperative CPL0 kernel-mode idle path** is the boot/AP path. `start`
(BSP), `start_ap` (APs), and the `start_current_cpu` loop call
`next_start_context`; when that returns no real runnable work they fall into
`idle_current_cpu_once`, which `hlt`s at CPL0 on the per-CPU kernel stack with
interrupts enabled (no `CpuContext`, no `restore_context` — the same way
`start_current_cpu` itself runs). A kernel-CPL timer tick or reschedule IPI
taken during that `hlt` runs the kernel-mode handler
(`kernel_timer_interrupt_handler` / `handle_reschedule_ipi`, both of which call
`nohz_recheck`), so the nohz one-shot deadline is preserved and re-armed across
the `hlt`; control then returns to the loop, which re-checks for work.
`idle_current_cpu_once` increments the `KERNEL_IDLE_HLT_ENTRIES` counter and
emits a bounded
`cpu-isolation: kernel-idle hlt cpu=… idle_path=cooperative-cpl0 … nohz_active=… timer_source=…`
log line so this path is observable from the kernel log; the
`run-scheduler-cpu-isolation-lease` smoke asserts it is reached. Once any
dispatch path `restore_context`s into a real thread, the `start_current_cpu`
frame is abandoned.

The **steady-state CPL0 idle-thread path** is reached from the four
interrupt/syscall-return dispatch call sites — `schedule()` (timer),
`capos_block_current_syscall`, `exit_current`, and `exit_current_thread`. When
`choose_next_locked` falls through to this CPU's idle thread, each site builds
the dispatch tuple from the per-CPU CPL0 idle-thread context. The dispatch
call sites hand a `CpuContext` to assembly that `restore_context`s (or, for the
timer path, return a context pointer plus a CR3 the timer handler loads), so
they need a schedulable context when no real work is runnable; the CPL0 idle
context is that context.

**CPL0 idle-thread context infrastructure.** `arch::smp::init_idle_kernel_stacks`
allocates one dedicated CPL0 idle kernel stack per scheduler CPU slot from
fresh contiguous frame ranges, so they do not overlap the boot kernel stacks,
the per-thread kernel stacks, or the IST slots. `CpuContext::new_cpl0_idle`
builds a kernel-shaped context (kernel-code/kernel-data selectors,
`rip = kernel_idle_entry`, `rsp` into the idle kernel stack). `sched::sched_init`,
called from `kmain`, constructs and stores one `CpuContext` per CPU slot in
`CPL0_IDLE_CONTEXTS` and then calls `register_idle_process_locked` to seed the
**slot-0 synthetic idle `Process` record** before the scheduler runs (this
keeps the BSP idle process's low PID and the init-process PID ordering stable);
the remaining per-CPU slots are registered lazily by
`current_cpu_idle_thread_locked` the first time their CPU reaches idle.
`sched_init` panics on OOM, as does the lazy path: the CPL0 idle contexts and
the synthetic idle records are scheduler idle infrastructure and there is no
fallback idle path, so a failure to build them is unrecoverable. The idle
kernel stack is sized as a **full per-thread kernel stack**
(`PROCESS_THREAD_KERNEL_STACK_PAGES`), not an IST slot, because
`kernel_idle_entry` runs the deep `service_periodic_work()` call chain on it
(see periodic-service parity below).

**Synthetic idle process records.** The idle thread is never a runnable
user-mode process. The synthetic idle `Process` (`Process::new_idle`) maps no
user code, no user stack, and no cap ring, and carries an empty cap table. It
exists only so the idle `ThreadRef` resolves through `sched.processes` and the
scheduler's `ThreadRef`-centric bookkeeping — `set_thread_state`,
`account_thread_selected_locked`, current-thread tracking, and the
`is_idle_thread` guard predicate used pervasively across the scheduler — keeps
working unchanged. Its `address_space` is a bare page-table root with nothing
user-mapped; it is required by the `Process` struct but is **never loaded as
CR3**. Every idle dispatch site routes the CPU onto the kernel PML4 via the
CPL0 idle context, so the synthetic idle `AddressSpace` is never made resident
and never participates in `resident_cpu_mask` or TLB-shootdown idle-residency
handling.

**Dispatch-tuple rewire.** After `choose_next_locked` returns, when the chosen
thread is `idle_threads[current_cpu_slot()]`, each dispatch site builds the
dispatch tuple from the CPL0 context pointer, the dedicated idle kernel stack
top, the kernel PML4 CR3, and the current FS base (no FS-base change).
`sched_init` builds one CPL0 idle context per scheduler CPU slot or panics, so
`cpl0_idle_context(slot)` is infallible at every dispatch site. The
`schedule()` timer path does **not** route through a dedicated CR3-loading
restore helper: the existing `timer_interrupt_handler` already loads the
tuple's CR3 with `write_cr3` before the privilege-agnostic five-element
`iretq`. The three syscall-path sites (`capos_block_current_syscall`,
`exit_current`, `exit_current_thread`) keep their
`restore_context_after_syscall` restore tail: they are entered via
`syscall_entry` (which already executed `swapgs`), so the exit `swapgs` is
required to leave the CPL0 idle thread running with the *user* GS base — the
same GS-base state the timer path's CPL0 idle thread runs with. Each site emits
a distinct marker: `sched: dispatch idle cpu=N idle_path=cpl0-dispatch-timer`
(timer), `…cpl0-dispatch-block` (blocking syscall), and `…cpl0-dispatch-exit`
(both `exit_current` and `exit_current_thread`). `debug_assert!`s guard the
CPL0 dispatch tuple: context `cs`/`ss` are the kernel selectors and their RPL
bits are 0.

**CPL0 idle periodic-service parity.** `schedule()`'s timer Phase 2 runs
periodic service work on every tick — deferred process drops, pending
terminations, `wake_cap_waiters`, `service_sqpoll_workers()`,
`drain_pending_endpoint_cancellations()`, `terminal_session::poll_input()`,
`virtio::poll_scheduler()`, and the network / pipe / interrupt `poll_waiters()`
calls. A CPL0 idle thread's timer ticks are kernel-mode and go through
`kernel_timer_interrupt_handler`, which never enters `schedule()` — so without
explicit parity handling that servicing would be stranded whenever a CPU is
parked on the CPL0 idle thread. That work is factored into a single
`service_periodic_work()` function with one lock discipline: the scheduler lock
is taken only for the bounded deferred-drop / thread-stack-release /
`wake_cap_waiters` / pending-termination extraction, then **dropped** before
`drop_pending_process` / `finish_terminated_process` and the lock-free poll
block. `schedule()` calls it after ring dispatch; `kernel_idle_entry` is its
own cooperative loop that, each iteration, runs `service_periodic_work()`, then
`next_start_context(false)` to re-dispatch a real runnable thread the moment
one appears (`allow_idle = false` so it never re-selects the idle thread), then
`idle_current_cpu_once()` to `hlt`. The re-dispatch is required: without it a
kernel-mode timer tick taken during the idle `hlt` returns through
`kernel_timer_interrupt_handler`, which does not re-enter `schedule()`, so the
CPU would be stranded. `service_periodic_work()` and `next_start_context()` run
with **interrupts disabled** in that loop — the CPL0 idle context is built
`IF=1` so the periodic tick can preempt the `hlt`, so the loop must `cli`
before the deep service call; otherwise a CPL0 timer tick taken *during*
`service_periodic_work()` nests a `kernel_timer_interrupt_handler` frame onto
the idle kernel stack (same-privilege interrupts do not switch stacks).
`idle_current_cpu_once` re-enables interrupts only across its `enable_and_hlt`
and disables them again before returning. There is no double-service: a CPU
running a real thread gets the service block via `schedule()`, a CPU on the
CPL0 idle thread gets it via the `kernel_idle_entry` loop, and a given tick on
a given CPU is CPL3 (`schedule()`) xor CPL0-idle (the loop). nohz cadence stays
honest because the loop iterates at the timer/IPI cadence — when the periodic
tick is suppressed the re-armed one-shot still wakes the `hlt`, so
`service_periodic_work()` still runs.

#### `iretq` CPL0 restoration invariant and CPL0 idle-thread prerequisites

This subsection records the load-bearing x86-64 architectural invariant that
any future CPL0 idle-thread context migration must satisfy, along with the
prerequisites the implementation will need to meet.

**Authoritative reference:** Intel 64 and IA-32 Architectures Software
Developer's Manual (SDM), Volume 2A, `IRET`/`IRETQ` instruction reference,
"Operation" pseudocode (the `IF OperandSize = 64` / 64-bit-mode path), and
Volume 3A, Section 6.14.3 "Returning from an Exception or Interrupt
Procedure." The description below applies to `IRETQ` in 64-bit long mode;
the legacy 32-bit `IRET` paths behave differently and are called out
explicitly where it matters.

**`iretq` frame layout and the 64-bit unconditional five-element pop.**
`iretq` in 64-bit long mode **unconditionally** pops five 64-bit (8-byte)
values from the top of the current kernel stack, in order: `RIP`, `CS`,
`RFLAGS`, `RSP`, `SS`. This is true **regardless of whether the privilege
level changes** — both a CPL0→CPL3 return and a CPL0→CPL0 return consume the
same five-element frame and load `RSP`:`SS` from it. AMD deliberately removed
the legacy conditional stack switch for long mode: the "skip `SS`:`ESP` on a
same-privilege return" behavior exists **only** in the legacy 32-bit `IRET`
operand-size paths, never in `IRETQ`.

- **CPL0 → CPL3 (privilege change, ring exit):** The target `CS` has RPL=3,
  which differs from the current CPL=0. The CPU installs `RIP`, `CS`, and
  `RFLAGS` from the frame, then loads `RSP` and `SS` from the same frame and
  transfers to the user-space instruction at `RIP` on the user stack.
- **CPL0 → CPL0 (same-privilege, no ring change):** The target `CS` has
  RPL=0, matching the current CPL=0. `iretq` **still pops all five elements**:
  it installs `RIP`, `CS`, and `RFLAGS`, and **also loads `RSP` and `SS`**
  from the frame, exactly as in the CPL3 case. There is no same-privilege
  short-circuit in 64-bit mode. The practical consequence for a CPL0 restore
  is the opposite of the legacy intuition: the frame's `rsp` and `ss` fields
  are load-bearing and **must** carry a valid kernel stack pointer and a valid
  RPL=0 stack selector, because the CPU will load them.

**Current code.** `restore_context` (`kernel/src/arch/x86_64/context.rs`
lines 311–328) sets `RSP` to the supplied `CpuContext` pointer, pops all
fifteen caller-saved and callee-saved GPRs (lines 315–327), and executes
`iretq` (line 328). The `CpuContext` struct (`context.rs` lines 133–155)
places `rip`, `cs`, `rflags`, `rsp`, and `ss` at the high end of the struct
(lines 150–154), matching the hardware interrupt-frame layout that the CPU
pushes when it enters the timer interrupt handler. The comment at line 149
("Pushed by CPU on interrupt from Ring 3") reflects how every `CpuContext` is
populated today, but the five-element `iretq` frame itself is not
CPL3-specific — `iretq` consumes the same five elements for any target CPL.

**User-thread contexts.** Every *user-thread* `CpuContext` is built by
`Thread::new_user` (`kernel/src/process.rs`), which sets
`cs = sel.user_code.0 as u64` (RPL=3, value `0x23`) and
`ss = sel.user_data.0 as u64` (RPL=3, value `0x1B`). Every `iretq` issued by
`restore_context` or `restore_context_after_syscall` into a user thread is
therefore a CPL0→CPL3 privilege change into a fully user-shaped context.

**CPL0 idle contexts coexist with user contexts.** The blocker for a CPL0
target is *not* `iretq` frame arithmetic: `iretq` pops the same five elements
for a CPL0 target as for a CPL3 target, so a frame carrying kernel selectors
and a valid kernel `rsp` `iretq`s correctly. The real requirements are in the
surrounding dispatch plumbing, all of which the CPL0 idle path satisfies:

- **CR3.** The dispatch call sites set `CR3` to the kernel PML4 for the CPL0
  idle path, not to any user `AddressSpace` page table. The synthetic idle
  `Process`'s `AddressSpace` is never loaded as CR3.
- **`swapgs` / GS-base.** A CPL0 idle context was never entered through the
  `syscall` path. The `schedule()` timer path reaches it through the timer
  handler's own CR3 load and the privilege-agnostic `iretq` tail (no `swapgs`
  in that path at all). The three syscall-path sites
  (`capos_block_current_syscall`, `exit_current`, `exit_current_thread`) keep
  their `restore_context_after_syscall` tail: those sites *were* entered via
  `syscall_entry` (which already `swapgs`ed), so the exit `swapgs` is required
  to undo it — leaving the CPL0 idle thread running with the *user* GS base,
  the same state the timer path produces.
- **Kernel-code and kernel-data selectors.** A CPL0 `CpuContext` uses
  `cs = sel.kernel_code.0 as u64` (RPL=0, value `0x08`) and
  `ss = sel.kernel_data.0 as u64` (RPL=0, value `0x10`). Because `iretq` loads
  `ss` unconditionally in 64-bit mode, `ss` must be a valid RPL=0 stack
  selector; the GDT data-selector privilege checks require an RPL=0 `ss` to be
  paired with an RPL=0 `cs`, so the whole context (`cs`, `ss`, `rsp`, CR3, GS
  base) is kernel-shaped together.
- **Idle kernel stack.** Each CPL0 idle thread has its own dedicated kernel
  stack (`arch::smp::init_idle_kernel_stacks`) that does not overlap any IST
  slot, any per-thread kernel stack, or the BSP/AP boot stacks. Because `iretq`
  loads `rsp` from the frame, the context's `rsp` points into this dedicated
  stack. It is sized as a full per-thread kernel stack because
  `kernel_idle_entry` runs the deep `service_periodic_work()` call chain on it.
- **No user `AddressSpace` residency.** The synthetic idle `Process`'s
  `AddressSpace` is never made resident and never participates in
  `resident_cpu_mask`, so TLB shootdown never stalls waiting for an idle CPU.
- **No blocking, no exit.** The idle thread never calls `cap_enter`, parks,
  blocks on any waiter, or exits. The `Invariants` section entry "The idle
  thread must never block in `cap_enter` or exit" carries forward unchanged.

`CpuContext::new_cpl0_idle` builds the kernel-shaped context,
`sched::kernel_idle_entry` is the entry point, and `sched::sched_init` wires
the per-CPU CPL0 idle contexts and seeds the slot-0 synthetic idle process
record (the remaining slots' records are registered lazily by
`current_cpu_idle_thread_locked`). All four dispatch call sites — `schedule()`,
`capos_block_current_syscall`, `exit_current`, `exit_current_thread` — route
idle dispatch onto the CPL0 idle context: the timer path returns the CPL0
context pointer plus the kernel PML4 CR3 in its dispatch tuple and relies on
the existing `timer_interrupt_handler` CR3-load; the three syscall-path sites
keep their `restore_context_after_syscall` tail so the syscall-entry `swapgs`
is undone. The CPL0 contexts are kernel-shaped across `cs`, `ss`, `rsp`, and
CR3 together.

## Measurement Policy

Design grounding for this policy: this document's scheduler invariants,
`docs/backlog/scheduler-evolution.md`,
`docs/proposals/scheduler-evolution-proposal.md`,
`docs/research/future-scheduler-architecture.md`,
`docs/research/out-of-kernel-scheduling.md`,
`docs/research/nohz-sqpoll-realtime.md`, and
`docs/research/completion-ring-threading.md`. In particular,
`docs/research/future-scheduler-architecture.md` keeps the always-on versus
benchmark-only scheduler telemetry split as an open scheduler question, and the
current answer is intentionally conservative.

The current `kernel/src/measure.rs` counters are benchmark instrumentation, not
normal operator observability. They stay behind the `measure` feature and
`CAPOS_THREAD_SCALE_GUEST_MEASURE=1` because they add atomics, cycle-counter
reads, phase bookkeeping, and in some cases sampled user RIP values to hot
scheduler, timer, TLB, ring, and serial paths. Normal QEMU and dispatch builds
must not depend on those counters being present.

The per-thread runtime-accounting ledger is split. The WFQ load-bearing core
fields, `runtime_ns`, `virtual_runtime_ns`, and `last_started_ns`, are
unconditional normal-build state on `ThreadCpuAccounting`: WFQ ordering,
`SchedulingPolicyCap.snapshot`, and `SchedulingContext` budget charging depend
on them outside `cfg(feature = "measure")`. The diagnostic fields
(`context_switches`, `preemptions`, `voluntary_blocks`, `migrations`,
`last_cpu`, blocked/exited stability probes, placement buckets, and per-phase
attribution counters) stay behind the `measure` feature. Permanent operator
observability is still separate work: it should expose low-rate, non-symbolic
snapshots derived from the unconditional ledger plus event counters such as
runnable queue depths or high-water marks, reschedule IPI sent/failed/pending
counts, TLB shootdown request/failure counts, and scheduler policy admission
or denial counts. Those counters must not allocate, log, read raw user PCs, or
perform cycle-timing in timer, unblock, direct-IPC fallback, requeue, or
steal-requeue paths.

Benchmark-only attribution stays in `measure`: per-phase thread-scale
checkpoints, guest cycle timings for ring/capnp/method/scheduler segments,
scheduler-lock wait and hold cycles, scheduler-lock site attribution, serial
byte attribution, timer-mode breakdown, CR3/TLB event totals, thread-placement
selection/migration buckets, raw user-PC samples,
logging-suppression A/B evidence, and workload/cacheline diagnostics. The
publish-placement publish/caller-aware buckets were retired with the per-CPU
run-queue collapse. Phase D shipped the fair-share enqueue policy but did not
reintroduce those placement counters.
A future branch may promote a specific event count only by adding the
normal-build storage/API and proving the same emergency-path constraints; it
should not simply remove the current `cfg(feature = "measure")` boundaries from
the benchmark module.

The publish-placement publish/caller-aware buckets are still retired;
Phase D Task 3 brought back per-CPU placement semantics but does not
re-emit the publish counters. Re-instate them through a separate
operator-observability slice that proves the same emergency-path
constraints, not by removing the existing `cfg(feature = "measure")`
boundary on the historical buckets.

Tickless idle is enabled only for true idle. A scheduler-owned CPU may mask the
periodic LAPIC tick when it is running the CPL0 idle context, has no runnable
non-idle work, has no active `CpuIsolationLease` nohz record, has no local
deferred cleanup, has no cap-enter polling dependency, and the one-shot
clockevent plus non-tick-derived monotonic clocksource are available. The
replacement one-shot is bounded by the nearest `Timer`/`ParkSpace` deadline or
a 100 ms idle housekeeping floor, and the scheduler restores
periodic mode before non-idle dispatch, reschedule-IPI wake, or rollback.
Cap-enter polling waiters, including the current terminal shell path, and
ready threads paused in a `SchedulingContext` retry window keep the periodic
tick until those dependencies move behind explicit deadlines or housekeeping
placement.

Generic full-nohz for ordinary budgeted compute threads carries the
clockevent/deadline substrate into the CPU-isolation state machine and suppresses
ticks only after network polling, IRQ affinity, accounting, deadline, lifetime,
and rollback obligations pass. SQPOLL nohz applies the same substrate to
explicitly leased caller-thread rings once the SQPOLL worker is live and the
single-consumer, owner-lease, wake, and rollback gates pass. Automatic policy
issuance and broader SQPOLL userspace-poller/device-queue admission remain
separate later CPU-isolation features; see
[Tickless and Realtime Scheduling](../proposals/tickless-realtime-scheduling-proposal.md)
and [NO_HZ, SQPOLL, and Realtime Scheduling](../research/nohz-sqpoll-realtime.md).

Exit switches to the kernel PML4 before tearing down the exiting address space,
releases capability authority, completes process waiters, defers final process
teardown until the scheduler is running on another kernel stack, and then
releases remaining thread kernel stacks through the scheduler-owned
`OffStackToken` path before the `Process` value is dropped.

## Invariants

- The idle thread must never block in `cap_enter` or exit.
- Ring dispatch must not hold the scheduler lock.
- Timer dispatch copies current-process user buffers through that process's
  locked `AddressSpace`; it must not rely on a raw current-CR3 validate/use
  window.
- Blocked `cap_enter` waiters wake when enough CQEs are available or their
  finite timeout expires.
- Timer sleep waiters must be bounded per process, tied to the caller
  `ThreadRef` generation, and removed when the caller process exits.
- Runtime-controlled FS bases must stay in user canonical space.
- Direct IPC handoff is a scheduling preference, not a bypass of process
  liveness, generation, or state checks.
- The scheduler must update TSS.RSP0 and the per-CPU syscall kernel RSP
  through `percpu::set_kernel_entry_stack` on each switch.
- Each `PerCpu.current_thread` mirrors that CPU's scheduler current slot; the
  scheduler lock remains the authority for current-thread and queue ownership
  even though dispatch/runnable state is now separate from shared process and
  thread metadata.
- Each live `ThreadRef` may appear in the per-CPU runnable queues at
  most once across all queues, and every per-CPU queue's capacity must
  be reserved up to the live runnable-capable thread count before a new
  process or thread becomes runnable.
- A live generation-checked `ThreadRef` must have at most one runnable
  dispatch owner across per-CPU `current`/`handoff_current` slots, the
  per-CPU runnable queues, and the direct IPC target.
- Queue migration (including the bounded steal path) must be a
  scheduler-lock-contained remove-before-publish transfer; no path may
  publish the same `ThreadRef` twice into any queue or leave a stale
  direct target after exit. Migration must recompute `virtual_finish_ns`
  at the destination and never carry the source's WFQ tag as committed
  state.
- Each per-CPU run queue must remain ordered ascending by
  `virtual_finish_ns` after every enqueue, requeue, or steal-requeue.
  Local selection scans the queue by index for the first
  destination-Runnable entry; RetryLater entries are left in place for
  the next scheduler pass. The bounded steal path scans each sibling
  queue's indices ascending for that queue's first Runnable-for-
  destination entry — because each queue is ordered ascending, the
  first Runnable hit per queue is the lowest `virtual_finish_ns`
  candidate the destination can accept on that source — then picks
  the source queue whose first-Runnable candidate has the lowest
  `virtual_finish_ns` globally, with ties broken by lower CPU id. The
  chosen entry is removed from its actual position on the source
  queue (not necessarily the head).
- Process and thread exit cleanup must assert, before releasing the
  scheduler lock, that the exiting process or thread has no remaining
  entry in any per-CPU runnable queue and no remaining direct IPC
  target slot.
- Timer, unblock, direct-IPC fallback, requeue, and steal-requeue paths
  must use reserved run-queue capacity and avoid allocation.
- Runtime accounting must use the normal monotonic clocksource, not
  benchmark-only cycle counters, and must charge only running intervals.
- FS base is saved and restored across context switches for TLS.
- Thread records remain generation-checked `ThreadRef` identities; exited
  records are retained only while a live handle, pending join, or unjoined
  status can still observe them.
- The final teardown of an exiting process must not release thread kernel
  stacks until another kernel stack is active, and the implicit `Thread::Drop`
  path must not free kernel-stack frames.
- A scheduler CPU must never run the same generation-checked `ThreadRef` twice
  at once; same-process siblings may run on different scheduler CPUs only when
  their completions route through distinct per-thread ring endpoints.
- Park waiters must be keyed by generation-checked `ThreadRef` values, reserve
  one waiter CQE credit, and must not allocate in wait, wake, timeout, or
  process-exit cleanup paths.

## Code Map

- `kernel/src/sched.rs` - shared process table plus `SchedulerDispatch`
  ownership of the per-CPU runnable queues (ordered ascending by
  `virtual_finish_ns`), per-CPU current/handoff slots, idle-thread
  slots, direct IPC target, run-queue reservation accounting, pending
  drops, and pending stack releases; also blocking, wakeups, Timer
  sleep waiters, the bounded steal path, and exit.
- `kernel/src/arch/x86_64/context.rs` - CPU context layout, timer entry/restore,
  tick counter.
- `kernel/src/arch/x86_64/idt.rs` - timer and IPI interrupt handler wiring.
- `kernel/src/arch/x86_64/lapic.rs` - xAPIC MMIO setup, PIT-calibrated LAPIC
  timer, LAPIC EOI, spurious-vector handling, and fixed-IPI send primitive.
- `kernel/src/arch/x86_64/tlb.rs` - serialized vector-49 TLB shootdown request,
  pending flush generations, completion token, and interrupt/user-return drain
  path.
- `kernel/src/arch/x86_64/pic.rs` and `kernel/src/arch/x86_64/pit.rs` - legacy
  PIC remap and PIT fallback setup.
- `kernel/src/arch/x86_64/gdt.rs` - BSP/AP TSS and kernel stack storage.
- `kernel/src/arch/x86_64/syscall.rs` - blocking syscall transition for
  `cap_enter`.
- `kernel/src/arch/x86_64/percpu.rs` - per-CPU syscall stack registry,
  TSS.RSP0 update hook, and current thread storage.
- `kernel/src/arch/x86_64/tls.rs` - FS base save/restore.
- `kernel/src/process.rs` - process state, kernel stacks, the synthetic idle
  process record, and per-thread CPU accounting storage/accessors.

## Validation

- `make run-smoke` validates timer preemption, ring fairness, direct IPC handoff,
  blocked `cap_enter` wakeups, process exit, and clean halt.
- `make run-spawn` validates process wait blocking and child exit completion
  through `ProcessHandle.wait`, Timer monotonic now/sleep completion through
  `timer-smoke`, per-process sleep quota isolation through `timer-flood`, and
  thread/park lifecycle behavior through `thread-lifecycle`.
- `make run-measure` validates the post-thread park blocked/resume timing path
  and process exit while a park waiter is parked.
- `cargo build --features qemu` verifies QEMU-only scheduler and halt paths.
- QEMU smoke output for IPC includes direct handoff diagnostics when the server
  is woken from a blocked RECV.

## Open Work

- Prove SQPOLL/poller progress that does not depend on periodic scheduler
  ticks before automatic nohz activation. Then implement tickless idle only for
  no-runnable-work CPU idle. Keep runnable contention on periodic preemption
  until the activation proof closes the remaining network polling, IRQ
  affinity, and housekeeping dependencies.
- Keep SMP behind per-CPU scheduler state and review of any path that needs
  page pinning beyond the `AddressSpace`-locked copy/read contract.
- Implement the remaining SMP Phase C slices: split shared scheduler metadata,
  replace the temporary scheduler-owner mask, and collect accepted benchmark
  evidence.
- Add priority or policy scheduling only after the current authority and IPC
  semantics remain stable.
- Add service restart policy outside the static boot graph.
