Scheduling
Scheduling decides which thread runs, preserves CPU state across preemption and blocking, and integrates capability-ring progress with process-owned execution resources.
Current Behavior
The scheduler stores shared process/thread metadata in
Scheduler::processes: BTreeMap<Pid, Process>. Dispatch-owned runnable state
lives in SchedulerDispatch: a per-CPU run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] array ordered ascending by Thread.virtual_finish_ns,
per-CPU current and handoff_current slots, idle-thread slots, the
direct-IPC target preference, run-queue reservation accounting, and
deferred drop/stack release slots.
Each live thread has at most one queued owner across all per-CPU queues
combined, and every per-CPU queue reserves capacity up to the live
runnable-capable thread count before a new thread is published as
runnable, so later timer, unblock, requeue, and steal-requeue paths do
not allocate. The shared live-reservation count is released when
processes or threads exit or when pre-publication reservation is rolled
back. Reserving each queue to the full live-thread count is required
because the bounded steal path may migrate every live thread into a
single sibling queue between two scheduler passes.
Phase D accepted its Task 6 diagnostic closeout at commit 77caafc0
(2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate)
and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC,
docs(scheduler): close phase d). The accepted
state is the WFQ scheduler described here: per-thread weights and
latency classes are mutated only through SchedulingPolicyCap, each
per-CPU runnable queue is ordered by freshly derived
virtual_finish_ns, migration preserves virtual_runtime_ns, and
bounded stealing selects the most-overdue runnable sibling candidate.
The controlled Task 6 benchmark pair on capos-bench recorded capOS
1-to-4 work/total speedups 3.088x / 2.700x versus the previous
single-global-queue baseline 1.566x / 1.538x; the matching Linux
pthread baseline on the same host and physical-core logical CPUs
0,1,2,3 recorded 3.974x / 3.850x. The host harness enforced the
configured 1-to-2 work/total gates; the 1-to-4 row was manually accepted from
recorded diagnostics. Phase E
SchedulingContext is the next scheduler authority phase; EEVDF is a
follow-on ordering-policy evaluation rather than a Phase D blocker.
Phase D Task 3 (2026-05-07) restored the per-CPU runnable queues that
the 2026-05-02 collapse retired and gave them the WFQ ordering Task 2’s
virtual_finish_ns was prepared for. Newly created processes and
threads publish onto the creating scheduler CPU’s per-CPU queue; the
bounded steal path balances the queues when other CPUs run out of local
work. The publish-time placement is intentionally simple in this slice
— “place locally, let steal balance” — and a more sophisticated
caller-aware spread or least-loaded scan is a milestone-gate follow-up,
not a Task 3 acceptance requirement. Wake policy carries
WakePolicy::QueueCpu(u32) for endpoint, timer, park, process-wait,
thread-join, and process-spawn completions so the wake target matches
the queue placement, and DirectTarget keeps its original direct-IPC
handoff role. The transitional CAPOS_SCHED_DISABLE_WFQ=1 /
WakePolicy::QueueAny fallback has been removed before Phase E
SchedulingContext schema work.
wake_idle_scheduler_cpus_locked first probes the placement target
when the policy is QueueCpu, then walks eligible idle scheduler CPUs
and wakes the first that accepts a fresh reschedule IPI, skipping CPUs
that already have a pending IPI so a burst of ready work cross-wakes
more than one neighbor instead of stranding the rest behind one
already-targeted CPU.
Ring SQ Consumer Ownership
Each ring endpoint has kernel-owned SQ-consumer metadata outside the writable
userspace ring page. cap_enter and the bounded timer-side current-thread ring
service both acquire a syscall-mode owner lease before calling
process_ring(). The lease carries a nonzero generation and owner identity;
process_ring() verifies that generation before flushing deferred ring work or
advancing SQ head, and stale owners return StaleSqConsumer without consuming
the head SQE. Duplicate owners fail closed as a retryable busy cap_enter
status.
CQ publication remains independent of SQ ownership. Already accepted completions stay visible through CQ head/tail even after the SQ owner releases, and thread/process teardown releases any live SQ owner before ring unmapping or record drop without clearing accepted CQEs.
Bounded SQPOLL ring mode
Phase F adds a bounded SQPOLL mode for the caller thread’s ring through
CpuIsolationLease with allowedMode = kernelSqpoll and namedRing = callerThread. The transition is explicit: syscall-owned dispatch may request
SQPOLL start while it still owns the SQ, then releases its generation-checked
owner; the poller finalizes into SqpollRunning, may publish
NEED_WAKEUP and enter SqpollSleeping, wakes back to running when a producer
publishes a new SQ tail, and stops or rolls back on lease revoke, cap release,
teardown, or failed start. Timer-side syscall-mode ring service fails closed
while SQPOLL owns the same endpoint, so no second SQ consumer can advance the
SQ head.
The Phase F poller runs from the periodic scheduler service path and from a
bounded current-thread syscall service entry used for SQPOLL producer wakes and
explicit syscall kicks. Both entries borrow the SQPOLL owner lease rather than
acquiring syscall SQ ownership. The current default admits two SQEs per
selected SQPOLL worker, and a worker is not reselected again in the same
periodic service pass or syscall service entry. Poller elapsed time is charged
to the admitted scheduler ledger or scheduling-context target. The wake/sleep
protocol uses a shared ring flag: the poller
publishes NEED_WAKEUP, performs a full ordering barrier, and rechecks SQ
tail before sleeping; producers publish initialized SQEs, store SQ tail with a
barrier, and enter the kernel if NEED_WAKEUP is visible. A cap_enter
producer wake that finds SQPOLL already owns SQ head can run one bounded SQPOLL
batch, return visible CQ availability when the requested threshold is
satisfied, preserve ordinary blocked-current-thread and thread-owned-head
results, and otherwise fail closed as a retryable busy result. Stale owner
generations fail before deferred ring work or SQE start. If teardown requests
stop after a live owner has already accepted a SQE, the poller still publishes
SQ head for that accepted SQE before releasing ownership, preserving accepted
CQEs without leaving work replayable by syscall mode. The focused
make run-scheduler-generic-sqpoll-nohz proof admits this explicit
ring-coupled shape into SQPOLL nohz, drives producer wake and bounded service
progress without depending on a periodic tick, then rolls back on stale
owner/lease revoke. Policy-service automatic nohz, broader
userspace-poller/device-queue admission, and production realtime admission
remain future work.
Per-CPU run queue ordering structure
Each per-CPU VecDeque<ThreadRef> is kept ordered ascending by
Thread.virtual_finish_ns. Enqueue performs an ordered insert via a
linear scan from the front; selection scans the queue by index for
the first destination-Runnable entry (via
pop_first_runnable_local_locked), removes Drop entries it walks
past, and leaves RetryLater entries undisturbed for the next
scheduler pass. Because the queue is ordered ascending, the
first Runnable hit is also the lowest-virtual_finish_ns candidate
the destination CPU can accept (the most overdue against fair share
that this CPU is allowed to run). Linear-scan insert is O(n) per
enqueue;
with SCHEDULER_CPUS = 4 and bounded thread counts in this slice the
constant is small enough to defer a smarter structure (sorted bucket
arrays, intrusive trees) until benchmark evidence shows it dominates
scheduler-lock hold time. Promoting to a smarter structure is a
follow-up under this plan if the Task 6 milestone gate proves the
need.
virtual_finish_ns is recomputed on every enqueue from the thread’s
current virtual_runtime_ns, weight, and latency_class; it is
never carried as committed state across blocking, and migrations
between per-CPU queues recompute it at the destination so the
destination’s view of fair-share progress applies. The derivation rule
per latency class is documented in capos-abi/src/scheduler.rs and
the “Latency-class semantics for Phase D” section of
docs/proposals/scheduler-evolution-proposal.md.
Bounded steal path
When a CPU’s local queue has no immediately runnable entry the
scheduler walks sibling per-CPU queues. For each sibling queue the
scan walks indices ascending and selects that queue’s first entry
that the destination CPU considers Runnable; because each queue is
ordered ascending by virtual_finish_ns, the first Runnable hit is
also the lowest virtual_finish_ns candidate available to the
destination on that source queue. The steal then picks the source
queue whose first-Runnable candidate has the lowest
virtual_finish_ns overall, with ties broken by lower CPU id. The
chosen entry is removed from its current position in the source
queue (not necessarily the head: a RetryLater or single-CPU-owner
thread may sit at the source’s front and stay there), the WFQ tag is
recomputed at the destination, and the entry is inserted at the
destination’s ordered position. The destination queue is reserved to
the full live-thread count, so the steal-requeue is allocation-free.
The scan walks at most SCHEDULER_CPUS * max_queue_len
entries, but in practice each sibling scan stops at the first
Runnable candidate per queue.
RetryLater semantics in the local scan
The local pop scan walks the per-CPU queue by index instead of
popping the front and re-pushing RetryLater candidates. Re-pushing a
RetryLater entry whose virtual_finish_ns has not changed would
ordered-insert it back at the same head position, so a naive
pop-then-requeue loop would re-pop the same RetryLater head every
iteration and starve runnable entries behind it. The index scan
removes Drop entries in place, leaves RetryLater entries undisturbed
for the next scheduler pass to re-evaluate, and returns the first
Runnable candidate it finds. The bounded steal path uses the same
index scan on the destination queue after a steal so a stolen
RetryLater entry does not get re-popped in the same dispatch pass.
Phase E preflight fallback cleanup
The one-bisect-cycle CAPOS_SCHED_DISABLE_WFQ=1 opt-out has been
removed. Enqueues always target the selected per-CPU WFQ queue, and
wake-up sites always carry WakePolicy::QueueCpu(slot) for queued
work. Phase E SchedulingContext work therefore starts from the
accepted Phase D WFQ behavior rather than from a source-level
single-global-queue fallback.
Phase E Task 1: scheduling-context object shape
The first SchedulingContext slice is info-only: schema, config,
runtime, and kernel code expose SchedulingContext.info() and a
bootstrap grant shape, but no dispatcher enforcement, replenishment,
donation/return, depletion notification, realtime island, SQPOLL, or
nohz behavior. SchedulingContextSpec.cpuMask uses the canonical
little-endian bitset defined in schema/capos.capnp: CPU n maps to
bit n % 8 of byte n / 8, with bit 0 as the least-significant bit
of that byte. Empty data means no CPUs are selected rather than all
CPUs. Producers omit trailing zero bytes, so the all-zero set’s
canonical form is empty and any non-empty canonical mask ends with a
nonzero byte.
Phase E Task 2: bind, revoke, and generation identity
The second SchedulingContext slice adds the first bounded authority
lifecycle. SchedulingContext.create()
creates a same-interface result cap for a validated spec, bindCallerThread()
records one caller-thread binding for the current context generation, and
revoke() advances the generation and clears the matching thread metadata
binding. Bootstrap-granted contexts and contexts returned by create() use the
same non-wrapping context-id allocator; the binding identity remains
(contextId, generation), but distinct cap objects no longer share bootstrap
ids. Stale caps report staleGeneration and cannot create, bind, or revoke
scheduler metadata for a new generation; already-revoked contexts report
revoked. Release cleanup clears only a thread metadata binding that matches
the released cap identity.
Phase E: SchedulingContext budget enforcement
make run-scheduling-context is the focused Phase E QEMU proof. It
starts one process with two independently granted bootstrap contexts, verifies
their identities cannot alias, adopts a created result cap, drives bind/revoke
and stale-generation calls, confirms release cleanup by rebinding after the
released cap drops, and now checks the first dispatcher budget behavior.
bindCallerThread() installs a fixed budget ledger in the caller thread’s
scheduler metadata. Runtime charge decrements that ledger at the same
scheduler-lock-contained points that update per-thread runtime/vruntime.
Runnable selection replenishes elapsed periods and treats exhausted bound
contexts as RetryLater until their next period, leaving the queued owner in
place rather than allocating or moving emergency-path state. Stale or revoked
contexts still fail closed before mutating scheduler metadata or accounting.
The current enforcement granularity is the existing periodic scheduler tick:
a running thread may overshoot its budget by the current tick quantum before
the next dispatch charge throttles it. The smoke therefore proves bounded
dispatcher behavior, not nohz/SQPOLL activation or hard realtime admission. It
prints dispatch_effect=budgetEnforced, visible budget charge, replenishment
to full budget after a period, and a throttled wall-clock window.
Phase F: CpuIsolationLease and automatic nohz activation
CpuIsolationLease is a separate authority surface from
SchedulingContext CPU-time budget enforcement. The scaffold records owner
identity, allowed CPU set, allowed isolation mode, live accounting target
reference, housekeeping exclusions, maximum revocation latency, and generation
identity. It rejects stale generations, duplicate or overlapping active leases,
fabricated or stale SchedulingContext accounting targets, malformed CPU masks,
and lease sets that would leave no online scheduler housekeeping CPU outside
the globally admitted active lease CPUs.
The scheduler-side preflight reports a bounded nohz activation/deactivation
decision surface: lease identity, target CPU mask, target runnable entity
count, active housekeeping CPU availability after subtracting all active lease
CPUs, selected housekeeping CPU mask, deferred cleanup, timer/deadline,
network polling, IRQ-affinity, accounting-target, monotonic
clocksource/accounting readiness, one-SQ-consumer, revocation latency,
rollback, and periodic-fallback labels. The accepted QEMU proof uses -smp 4
so an active lease can report ready housekeeping CPUs outside the target CPU,
selected housekeeping placement, and exactly one runnable caller on that
target CPU.
The clockevent/deadline substrate uses a calibrated TSC-backed monotonic
clocksource on normal QEMU/x86_64, with the periodic LAPIC tick disciplining
the TSC epoch so QEMU guest halt windows cannot stall wall-clock progress.
Timer.sleep, finite cap_enter, and park timeouts store absolute monotonic
deadline_ns values, and the LAPIC clockevent backend can program a bounded
one-shot deadline and restore periodic mode.
Automatic nohz activation state machine
When the preflight finds every proof obligation satisfied – a single
runnable entity on the target CPU, a ready housekeeping CPU outside the lease,
no local deferred-cleanup/timer dependency, a valid accounting target, a live
monotonic clocksource, a non-stale one-SQ-consumer when a ring is named, a
bounded revocation latency, and the lease’s allowedCpuMask naming exactly
one scheduler-owned CPU – it performs real per-CPU periodic-tick
suppression for that narrow single-runnable window. The target CPU may be
the CPU running the preflight call (local activation) or a different
scheduler CPU (remote-CPU activation via a reschedule IPI – see Remote-CPU
activation below). The single-runnable shape differs by target: a local
activation requires the caller itself to be that single entity
(exactly-one-runnable-caller); a remote activation requires the target
CPU’s single runnable entity to be some thread pinned there, not the caller
(which runs on a different CPU – exactly-one-runnable-remote-target).
- Admission gates. Two lease shapes can be admitted for tick suppression:
a pure
namedRing = nonecompute lease, and a ring-coupledallowedMode = kernelSqpolllease whose bound ring is being actively driven by a live SQPOLL consumer.- Compute lease (
namedRing = none). Declares no local network/IRQ dependency, so the read-only network-polling and IRQ-affinity admission gates pass. - Ring-coupled SQPOLL lease (
allowedMode = kernelSqpoll,namedRing = callerThread). The lease’s declared kernel-polled work IS the bounded SQPOLL ring poller, which the scheduler keeps progressing throughcap_enter/producer-wake even while the periodic tick is masked. The preflight admits it only when the bound ring is in SQPOLL running/sleeping mode with a non-staleSqpollowner; the one-SQ-consumer label is thenblocked-sqpoll-owner(the worker owns the ring). The preflight ring-state read is a best-effort hint – it never takes the per-ring lock inside the scheduler lock (it usestry_lock, and a contended snapshot does not admit activation). The decisive disqualifier is the IPI/timer re-check below. - A
namedRing = callerThreadlease that is notkernelSqpoll(compute-with-ring) keeps the conservative refusal until network polling and IRQ affinity are routed to a housekeeping CPU, as does any device-owning mode. The kernel still services virtio RX/TX andInterruptwaiters inline from the periodic scheduler path.
- Compute lease (
- Activate. The preflight masks the periodic LAPIC timer on the current
CPU and arms a one-shot deadline at
min(nearest pending timer wakeup, now + max revocation latency). The CPU now runs on a bounded one-shot deadline instead of the periodic tick. The eligible lease generation is registered so revoke/cleanup paths can stale it. - Re-check. On every timer interrupt and on every reschedule IPI the
handler re-checks the activation window before the scheduler picks the next
thread. The reschedule-IPI handler also drains any pending remote-CPU
activation request parked for this CPU (the IPI vector is shared with the
remote-activation path – see Remote-CPU activation below), and the
periodic timer handler drains it too as a backstop.
An unchanged eligible window re-arms the bounded one-shot deadline;
a reschedule IPI (the prompt signal that another CPU woke runnable work onto
this CPU) drives an immediate rollback. The re-check runs in interrupt
context and uses
try_lockto avoid deadlocking against a held scheduler lock. Armed-timer invariant: the masked-periodic one-shot does not auto-rearm, so a timer-interrupt re-check NEVER returns leaving a tickless CPU without an armed timer – on scheduler-lock contention it arms a bounded minimum-delta fallback one-shot (or restores the periodic tick) before returning. A lock-free per-CPUnohz-activebitmask lets the contention path distinguish a tickless CPU (the consumed timer was the nohz one-shot and must be replaced) from a normal CPU (the periodic tick auto-rearms). A reschedule IPI does not consume the one-shot, so its contention skip is safe – the still-armed one-shot bounds the next re-check. - Rollback. Any disqualifying change rolls the CPU back to the periodic
LAPIC tick first, before any further ordinary work: a stale lease
generation (explicit revoke, process exit, service replacement, session
logout), a second runnable entity or stealable sibling work on the target
CPU, a local deferred-cleanup dependency, a direct-IPC target becoming
runnable, a target-CPU mismatch, or a one-shot backend that can no longer
arm a deadline. For a ring-coupled SQPOLL activation the re-check also
carries a
sqpoll-ring-mode-changed-or-owner-staleddisqualifier (the bound ring leaving SQPOLL running/sleeping mode or its owner staling); that re-check runs under the scheduler lock and usestry_lockon the per-ring lock, so a contended ring is treated as disqualifying (fail-closed – restore the periodic tick rather than keep a CPU tickless on an unverifiable ring). That SQPOLL ring-mode branch is defense-in-depth, currently subsumed by lease-generation staling: every reachable SQPOLL-stop path today (stop_sqpoll_for_lease/stop_sqpoll_if_owned) is a revoke/cleanup-path caller that also stales the lease, andstale-lease-generationis checked first – so the lease-generation stale is the load-bearing SQPOLL rollback trigger in practice. The SQPOLL ring-mode branch becomes independently load-bearing, and would then need its own proof, only if a future change introduces a SQPOLL-stop path that keeps the lease live. Runtime accounting stays boundary/counter driven and monotonic, so suppressing the tick never strandsSchedulingContextbudget charging.
Remote-CPU activation
Masking the periodic LAPIC tick and arming the one-shot deadline are per-CPU
operations – only the target CPU can program its own LAPIC timer. When the
preflight runs on CPU A but the lease’s single-CPU allowedCpuMask targets a
different CPU B, the kernel does not refuse: it parks a bounded
remote-activation request in CPU B’s per-CPU slot and sends a
reschedule-style IPI to CPU B. CPU B drains the request from its IPI handler
(and from its periodic timer handler as a backstop), re-runs the full
disqualification check locally under its own scheduler-lock acquisition,
and only then arms its own one-shot deadline. A remote activation is never
trusted blind – the preflight’s eligibility snapshot was taken on a
different CPU and may be stale by the time the IPI is drained, so the target
CPU re-checks before committing. The relevant invariants:
- Bounded request slot, no nesting. The pending-request store is a fixed
[Option<_>; SCHEDULER_CPUS]array – one single-entry slot per CPU, so it can never grow unbounded. If a slot already holds an undrained request, a new preflight fails closed (rejected) rather than queuing behind it. The IPI-context drain never nests the scheduler lock: it takes only the small per-CPU slot mutex, then calls the activation intry_lockmode. - Contention retry. If the IPI-context drain finds the scheduler lock contended, it leaves the request parked and returns; the target CPU’s next periodic timer tick (still live – the tick has not been suppressed) retries the drain. Progress is bounded by the periodic tick the same way the existing local re-check contention path is.
- Fail-closed IPI ordering. A remote rollback
(
rollback_nohz_for_lease) stales the lease generation before clearing the activation record. The drain re-checks the generation before arming, so a rollback that races the drain fails closed (the request is dropped, the periodic tick stays live). If the drain already committed before the rollback cleared the record, the target CPU’s nextnohz_rechecksees thenohz-activebit set with no record and restores its periodic tick. Either ordering converges on the periodic tick. - Compute-only. Remote-CPU activation is limited to
namedRing = nonecompute leases in this slice. A ring-coupled SQPOLL lease whose target differs from its ring owner’s CPU is not an admitted shape; it fails closed.
Generic full-nohz admission for ordinary budgeted compute threads is available
only through an explicit SchedulingContext-targeted compute lease and the same
fail-closed placement gates described above. The SQPOLL nohz state machine now
admits explicitly leased caller-thread rings when the SQPOLL worker is live,
single-consumer, and bounded by producer wake/deadline rollback. Broader
userspace-poller/device-queue admission, automatic CPU-isolation issuance, and
production realtime island admission remain future work; auto_nohz stays
disabled. Timeout-based auto-revoke landed 2026-05-30 15:22 UTC: a CpuIsolationLease
created with leaseLifetimeNs > 0 records an absolute expiry deadline,
auto-revokes through the existing generation-advancing cleanup on first
observation past it (reason=lease-expired), and the nohz activation record
carries the lifetime deadline so a tickless CPU rolls back at the next
timer/IPI recheck (lease-lifetime-expired disqualifier), bounded by
maxRevocationLatencyNs. A leaseLifetimeNs of 0 preserves the prior
revoke/cleanup-only lifecycle. The current
SQPOLL-driven activation is the bounded case: tick suppression for a
ring-coupled kernelSqpoll lease on the CPU running the preflight, rolled
back through lease-generation staling on revoke/cleanup, with the SQPOLL
ring-state re-check as defense-in-depth for any future SQPOLL-stop path that
does not stale the lease.
Lease revocation and cleanup are generation-aware. Explicit revoke, process
exit, service replacement through process termination, and session logout stale
the matching generation so old caps cannot keep isolation eligibility alive,
and rolling the matching lease’s active nohz window back to the periodic tick
is part of the same cleanup path.
make run-scheduler-cpu-isolation-lease is the broad QEMU proof for grant,
info, revoke, cleanup, real nohz activation and fail-closed rollback, bounded
SQPOLL start/sleep/stop, rollback labels, generic full-nohz, and SQPOLL nohz.
make run-scheduler-generic-sqpoll-nohz is the focused SQPOLL proof for
eligible ring admission, producer wake, SQPOLL service, rollback, and stale
owner rejection.
Phase E: endpoint donation and return
Synchronous endpoint delivery now carries a bounded internal donation token
when a caller thread with a bound active SchedulingContext delivers a CALL
to a receiver thread that has no scheduling context of its own. Donation is
strictly passive-server shaped: receivers that already have a scheduling
context keep their own authority, unbound callers donate nothing, and callers
that receive a donation token are blocked from returning to userspace until
the in-flight endpoint call returns or is canceled.
At delivery, the scheduler charges pre-donation caller runtime before moving
the context ledger to the receiver. While the receiver handles the endpoint
message, normal dispatcher runtime charging decrements the donated context.
When endpoint RETURN commits the caller completion, the scheduler first charges
receiver runtime since dispatch, then returns the remaining budget and
next-replenishment state to the caller’s thread metadata and rebinds the
SchedulingContext record to the caller. Return preflight failures leave the
in-flight donation in place, while application-exception RETURN,
invalid-result RETURN errors, delivery failure, return cancellation, endpoint
teardown, process/thread exit, and stale-caller cleanup return or clear the
donation before waking the caller and without allocating new emergency-path
storage. Nested donation of an already donated context is rejected; supporting
stacked donation is deferred until it has an explicit return-token stack
design.
make run-scheduling-context proves the behavior with a same-process endpoint
round trip. The caller binds a fresh context, burns CPU immediately before
CALL, the passive server burns CPU while servicing the endpoint CALL and again
immediately before RETURN, and after RETURN the caller observes the reduced
budget restored. The same smoke covers application-exception RETURN,
oversized-result RETURN under donation, and deterministic rejection of
A-to-B-to-C nested donation. It also submits a delivered donated CALL and then
uses cap_enter(0, 0) while the server delays RETURN, proving the donor cannot
continue outside the donated ledger. A fast-return variant covers the race where
the receiver returns before the caller commits to the donation-blocked scheduler
state. The smoke prints endpoint_donation=ok, endpoint_return=ok,
endpoint_exception_return=ok,
endpoint_invalid_return=ok, endpoint_nested_rejected=ok,
endpoint_donor_block=ok, endpoint_donor_fast=ok,
endpoint_donation_server, endpoint_donation_after,
endpoint_exception_return_after, endpoint_invalid_return_after,
endpoint_nested_after, endpoint_donor_block_elapsed_ns,
endpoint_donor_block_after, endpoint_donor_fast_elapsed_ns, and
endpoint_donor_fast_after.
Phase E: SchedulingContext notifications
Every SchedulingContext now owns fixed notification storage allocated at
context creation or bootstrap. The storage has two coalescing slots:
budgetDepleted and deadlineOrTimeout. Each slot records context
id/generation, a saturating sequence, a saturating coalesced-event count, the
last holder thread, remaining budget, the next replenishment/deadline
timestamp, and whether the holder was using an endpoint-donated context.
Runtime charge records depletion when remaining budget transitions to zero and
records deadline/timeout expiry against the same context generation. Failed
bind attempts do not arm a new budget/deadline window.
SchedulingContext.drainNotifications() returns typed observer results:
ok drains the matching fixed cells, revoked reports the current revoked
generation, and staleGeneration reports an old observer generation without
draining the current record. Explicit revoke() records an explicitRevoke
lifecycle event. These notifications explain already-enforced scheduler state;
they do not donate budget, reorder runnable entities, bypass throttling,
publish result caps, append unbounded queues, allocate on scheduler hard paths,
or imply auto-nohz/SQPOLL/tickless behavior. A pre-armed observer waiter/wakeup
path remains a future extension.
make run-scheduling-context proves the notification slice by repeatedly
draining a depleted context after coalescing, observing deadline expiry,
recording explicit revoke and stale-observer labels, and confirming that
endpoint-donated runtime records notification state on the donated context. The
smoke prints notification_coalescing=ok, deadline_notification=ok,
revoke_notification=explicitRevoke, stale_notification=staleGeneration,
and endpoint_donated_notification=ok.
Phase E: session logout lifecycle hook
UserSession.logout() now notifies the scheduler after the session liveness
cell transitions from live to logged out. That covers explicit
UserSession.logout() calls, including the remote DTO gateway logout command
and connection-teardown path because those paths already call the same kernel
UserSession.logout() method. The hook scans scheduler-owned process/thread
metadata for live processes whose immutable SessionContext shares the logged
out liveness cell, removes each non-donated matching thread binding from the
scheduler ledger, and asks the bound SchedulingContext record to advance its
generation and mark itself revoked. Old ordinary SchedulingContext grants
therefore report stale generation through info() with zero visible remaining
budget and InfoOnlyNoDispatchChange. The focused session-context smoke also
proves stale bindCallerThread() does not rebind, stale create() does not
publish a result cap, stale revoke() does not mutate the current metadata
generation, and stale notification draining reports a stale observer result.
The hook intentionally does not use session code as a second scheduling-context
ledger: session lifecycle code only flips liveness and notifies the scheduler,
and the scheduler owns the scan and binding removal. The scan takes one binding
at a time under the scheduler lock, drops that lock, then calls the
SchedulingContextExitCleanup record hook so it does not invert the existing
SchedulingContext record-lock to scheduler-lock order used by
bindCallerThread().
In-flight endpoint donation uses a conservative counted/skipped logout policy.
If the logged-out session owns a receiver thread that currently holds a
donated context, the logout hook records that the donated binding was skipped
rather than returning donor budget while the endpoint call remains in flight.
The focused session-context smoke proves the donor remains blocked in
cap_enter(0, 0) until the receiver returns, the hook reports
donation_inflight_skipped=1, and endpoint RETURN removes the receiver
binding while restoring only the reduced remaining budget to the donor. This
does not add a new logout-triggered cancellation semantic. Local owner-shell
exit now calls the held UserSession.logout() before clean shell process exit,
so the same scheduler hook observes shell logout with
stale_marked=0 donation_inflight_skipped=0 in the shell smoke. The ordinary
bound-context stale proof remains the focused session-context smoke, because
the normal shell does not hold a bound SchedulingContext. Process and thread
exit cleanup already have their own stale-context coverage and are unchanged.
Realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain future Phase F/G work.
Phase D Task 4: migration fairness invariants
Phase D Task 4 (2026-05-08) made three migration-fairness invariants explicit:
virtual_runtime_nstravels with the thread. It lives onThread.cpu_accounting, not on a per-CPU slot, so a migration from CPU A to CPU B preserves the thread’s accumulated weighted-fair share. The accounting field was promoted out ofcfg(measure)in Task 2 and continues to advance throughcharge_runtimeregardless of which CPU charges the quantum.virtual_finish_nsis derived per enqueue, never committed. Every enqueue site – the initial publish inenqueue_ready_thread_on_slot_locked, the post-block requeue inenqueue_unblocked_thread_on_slot_locked, and the steal-insert insteal_from_sibling_queues_locked– routes throughrefresh_virtual_finish_ns_locked, which readsthread.weight,thread.latency_class, andthread.cpu_accounting.virtual_runtime_nsfresh and recomputes the WFQ ordering tag. The field is never carried as committed state across blocking and is never carried with the thread on migration; the destination CPU’s view of weight, latency class, and quantum decides the new tag.- Steal recomputes at the destination. The pop-from-source step in
steal_from_sibling_queues_lockedis followed byrefresh_virtual_finish_ns_lockedagainst the destination slot before the ordered insert, so aSchedulingPolicyCap.setWeightthat landed between source enqueue and steal takes effect at the steal itself.
Migrations counter shape
ThreadCpuAccounting.migrations is cfg(feature = "measure")-gated
and remains a benchmark-only operator-observability counter; it is
not load-bearing for ordering and is not exposed through
SchedulingPolicyCap.snapshot. Phase D Task 4 moved the increment
from the dispatch-time scheduled_measure path to two enqueue-time
arms in kernel/src/sched.rs:
- Placement-time spread (
record_placement_spread_migration_locked) fires frompush_reserved_run_queue_lockedwhen the enqueue target slot differs from the thread’s previously dispatched CPU (ThreadCpuAccounting.last_cpu). A thread that has never been dispatched (last_cpu == None) does not register a migration on first publish; otherwise placement spread is counted exactly once per enqueue. - Steal (
record_steal_migration_locked) fires fromsteal_from_sibling_queues_lockedafter the source-queue removal and before the destination-queue insert. The steal scan skips the destination slot, so the counter increments unconditionally each time the steal arm is reached.
scheduled_measure still maintains last_cpu so the placement-spread
check has the previous CPU available; only the migrations++ moved.
The pre-collapse counter shape is preserved in steady state – a
thread that runs on a different CPU than its previous run still
records exactly one migration – but the increment is now attributed
to the enqueue decision (placement spread or steal) rather than the
dispatch that follows it.
The aggregate process-wide thread_placement counter family in
kernel/src/measure.rs (migrations, migration_to_cpu0..3,
consumed by tools/qemu-thread-scale-harness.sh) is a separate
measurement device. It is incremented from
account_thread_selected_locked at dispatch time and continues to
observe “thread ran on a different CPU than its previously
dispatched CPU” rather than the per-thread Task 4 enqueue-time
shape, so the thread-scale harness regex does not need to change.
The per-thread ThreadCpuAccounting.migrations field and the
aggregate thread_placement counter intentionally measure different
events at different points in the scheduling pipeline; both stay
behind cfg(feature = "measure").
Phase H: per-thread saturation status surface
The Phase H AutoNoHz placement heuristic (a future policy-service
feature) needs to read per-thread saturation observation in the normal
dispatch build, not only under cfg(feature = "measure"). The
non-measure per-thread saturation status surface (2026-05-30)
promoted the inputs it consumes into ordinary ThreadCpuAccounting
state and exports them through SchedulingPolicyCap.snapshot @2:
voluntary_blocksandpreemptionsmoved out ofcfg(feature = "measure"). They are charged at the same sites as before –voluntary_blockswhen a thread blocks itself (cap_enter wait, park, endpoint scheduling-context donation) andpreemptionswhen the timer requeues a still-runnable running thread – so themeasurebuild’s counts are unchanged; only thecfggate was removed. A lowvoluntary_blockscount distinguishes a CPU-saturating thread from an IPC/IO-bound one.runnable_accumulated_nsis a new always-built cumulative counter of runnable-but-not-running time. It is charged at the scheduler-lock-held enqueue/select boundary:push_reserved_run_queue_lockedstamps a monotonicrunnable_since_nswhen a thread is published to a per-CPU run queue without being selected (idempotent across re-publish, so the whole runnable span is counted once), andaccount_thread_scheduledaccumulates the monotonic delta and clears the stamp when the thread is next selected. The stamp/accumulate pair nets to zero for a thread selected at the same monotonic instant it becomes runnable. The clock ismonotonic_ns()only (no wall-clock, no rewind), matchingcharge_runtime’s discipline, and the stamp respects the runnable-ownership rules above (a thread holds a live stamp only between enqueue and selection).
migrations stays measure-gated; it is a placement diagnostic, not a
saturation input. The surface exports raw cumulative counters only –
windowing, smoothing, and the saturation decision are policy-service
choices, never kernel state (see
docs/proposals/tickless-realtime-scheduling-proposal.md). Proof:
make run-thread-fairness reads the extended snapshot on the weighted
workers and asserts the CPU-bound hog reports high runtime_ns with
voluntary_blocks at or near zero while at least one preempted
lower-weight worker reports nonzero preemptions and
runnable_accumulated_ns.
Weight-change-while-enqueued contract
SchedulingPolicyCap.setWeight writes the validated weight directly
to Thread.weight through Process::set_thread_weight and does not
clear Thread.virtual_finish_ns. A weight change observed while the
thread is blocked, running, or already queued takes effect on the
next dequeue and re-enqueue because every enqueue site refreshes
virtual_finish_ns from current weight/latency_class/
virtual_runtime_ns. The kernel proves the contract two ways:
- By construction.
Process::refresh_thread_virtual_finish_nsreads each input field fresh on every call; there is no cached derivation between enqueues. The function bears a doc-comment asserting the contract. - By
debug_assert!. Inside the same function, a debug assertion verifies that the recomputedvirtual_finish_nsis at or beyond the currentvirtual_runtime_ns– a future deadline, never a past one. The assertion catches any future regression where the formula could underflow or where a stale cache could drift below the current vruntime.
The focused QEMU smoke that drives setWeight and verifies the
post-block dispatch picks up the new weight landed under Phase D
Task 5: make run-thread-fairness-weight-change (manifest
system-thread-fairness-weight-change.cue, demo
demos/thread-fairness/). Two competing child threads run a
fixed wallclock window: a baseline worker stays at
DEFAULT_WEIGHT, while a heavy worker self-calls
SchedulingPolicyCap.setWeight(weight=128) and then blocks on
Timer.sleep so it leaves the run queue before the contention
window opens. Each worker snapshots its scheduler state at wake
and at window end via SchedulingPolicyCap.snapshot, and the
parent verifies three independent properties: (1) the heavy
snapshot reads weight == 128 and the baseline snapshot reads
weight == DEFAULT_WEIGHT; (2) the observed runtime_ns ratio
matches the weight ratio inside a configured tolerance; (3) the
heavy worker’s virtual_runtime_ns advances at roughly half
the rate of its runtime_ns (vruntime/runtime ~= 0.5 for
weight=128, ~= 1.0 for DEFAULT_WEIGHT). A scheduler that
re-enqueued or dispatched the heavy worker using a stale
virtual_finish_ns derived from DEFAULT_WEIGHT would not
show the weight-proportional CPU share, and a scheduler that
held a stale weight inside charge_runtime would yield heavy
vruntime/runtime ~= 1.0 instead of ~= 0.5; the smoke trips on
either regression. The capability is bound to
CapCallContext::caller_thread (Phase D Task 2 decision), so
same-thread self-mutation is the only authorized shape for this
proof; cross-thread weight authority remains a Phase H
privileged scheduler-policy service concern.
The thread-scale benchmark was repaired before accepting the milestone. The old
1 MiB/spinning-parent shape was not a valid four-core reference because the
matching Linux pthread baseline also failed at four workers. The accepted
benchmark shape uses a blocking parent join, 262,144 blocks (16 MiB), and
work_rounds=64. The formal accepted-evidence pair is the capos-bench
2026-05-02 21:38 UTC 5-run pair pinned to physical-core logical CPUs
0,1,2,3 against main commit 374f8556: capOS work 1.883x and total
1.787x clear the configured 1.6x gates, while the matching Linux
pthread baseline records 1.988x/1.987x. Its 1-to-4 row became the
diagnostic that justified Phase D’s fair-share enqueue policy: capOS
1.566x/1.538x versus Linux 3.963x/3.858x, a clear bottleneck
in the then-current single-global-queue scheduler. Phase D’s WFQ evidence on
2026-05-10 manually accepted the recorded 1-to-4 diagnostic with capOS
3.088x/2.700x and matching Linux 3.974x/3.850x on the same host/CPU
pin set. The harness still enforced only the configured 1-to-2 work/total
speedup gates. Historical pre-collapse 1-to-2
(1.828x/1.687x) and the post-collapse 3-run diagnostic on
capos-bench 2026-05-02 10:42 UTC (1.890x/1.792x,
1.504x/1.436x) remain in docs/benchmarks.md for reference.
Four-worker capOS scaling was a follow-up rather than a completed claim
under the pre-collapse model: the unsuppressed diagnostic recorded 1-to-4
work/total speedups 3.029x/2.386x, while suppressing scheduler switch
logs recorded 3.272x/2.303x; remaining guest-measure evidence pointed at global
Scheduler lock contention plus exit/join/block/schedule overhead, and normal
scheduler-owned execution is still capped at temporary CPU slots 0-3.
Each process currently owns one or more Thread records; each thread owns its
saved CPU context, kernel stack, FS base, block state, and – since Phase D
Task 2 – the WFQ ordering inputs weight: u16, latency_class: LatencyClass,
and virtual_finish_ns: u64. The Phase D constants in
capos-abi/src/scheduler.rs set the defaults weight = DEFAULT_WEIGHT and
latency_class = LatencyClass::Normal, so unmodified workloads observe no
behavior change versus the pre-Phase-D scheduler. virtual_finish_ns is
recomputed on every enqueue (Task 2 ships the derivation; Task 3 will consume
it for ordered insertion) and is not meaningful while the thread is blocked.
Phase D Task 2 split the per-thread CPU accounting record so the WFQ-load-
bearing fields are available in the normal qemu build:
runtime_ns, virtual_runtime_ns, and last_started_ns are unconditional;
context_switches, preemptions, voluntary_blocks, migrations,
last_cpu, and the *_runtime_stable_observed and blocked/exited
bookkeeping stay behind the measure feature because they are pure
operator-observability counters that do not participate in dispatch ordering
and need a separate operator snapshot path. runtime_ns advances 1:1 with
elapsed CPU time, while virtual_runtime_ns advances by
elapsed_ns * REFERENCE_WEIGHT / weight so per-thread weight changes the
cumulative WFQ share rather than only the enqueue tag. The runtime-charge
path is invoked when a current thread stops running through timer preemption,
blocking cap_enter or park, thread/process exit, or direct switch/handoff
paths that select another current thread; the wrapping helpers in
kernel/src/sched.rs route through Process::charge_thread_runtime /
Process::account_thread_scheduled unconditionally now.
The SchedulingPolicyCap cap surface mutates these per-thread fields through
the caller-thread fallback binding selected in Phase D Task 2: every
method (setWeight, setLatencyClass, snapshot) routes to
CapCallContext::caller_thread, so a holder can only mutate or observe its
own running thread. Cross-thread or cross-process authority is reserved for
the Phase H privileged scheduler policy service. The
SchedulingPolicyCap.snapshot reply intentionally exposes only the four
fields promoted out of the measure feature gate;
context_switches/preemptions/voluntary_blocks/migrations are
benchmark-only and a future operator-observability slice may add them
through a separate cap. The BSP scheduler tick normally arrives through the
local APIC timer on vector 48 with LAPIC EOI after calibrating the LAPIC initial
count against PIT channel 2; if LAPIC setup or calibration is unavailable, the
kernel falls back to the legacy PIT/PIC IRQ0 path on vector 32. On each
user-mode timer tick (kernel-mode ticks bypass the scheduler entirely
through kernel_timer_interrupt_handler, as described under Design),
the kernel wakes timed-out or satisfied cap_enter and park waiters,
processes the current thread’s ring endpoint in timer mode, saves the
current thread context, picks the next ready thread from the single
global run queue (the earlier per-CPU local-first / steal scan was
retired with the queue collapse), switches CR3 when needed, updates
the current CPU’s kernel-entry stack through the per-CPU hook,
restores FS base, mirrors the next ThreadRef into the current
PerCpu, and returns to the next user context.
When APs are online and their LAPIC timers start, scheduler CPU slots 0-3 can
temporarily own scheduler/user execution. The earlier AP-owner proof kept the
BSP in kernel idle; the current same-process scaling slice allows sibling
threads with distinct ring endpoints to run on different scheduler CPUs while
processes that hold broad launch/authority caps or live endpoint objects
remain pinned to the legacy single-owner CPU. Additional APs beyond CPU 3 stay
in kernel idle until a later scheduler-owner policy replaces the temporary CPU
mask. The runnable queues are a per-CPU array of VecDeque<ThreadRef> shared
by the scheduler-owned CPUs under the global scheduler lock and ordered
ascending by virtual_finish_ns; process/thread metadata remains shared under
that lock. A bounded steal path migrates the most overdue sibling
candidate (each sibling queue’s first entry that the destination CPU
considers Runnable) when a CPU’s local queue has no runnable entry.
Syscall entry initializes kernel GS with swapgs, saves the user RSP through
the GS-relative PerCpu.user_rsp slot, and switches to the GS-relative
PerCpu.kernel_rsp slot. Normal syscall returns swap back before sysretq.
Blocking cap_enter, process exit, and ThreadControl.exitThread paths that
leave through scheduler iretq restore use restore_context_after_syscall so
GS ownership is returned to userspace before the next user context resumes.
Timer.sleep records a bounded scheduler waiter keyed by caller ThreadRef,
user data, and an absolute monotonic deadline_ns. Due sleeps validate the
thread generation, post an empty completion directly to the caller’s CQ, and
then flow through the same blocked cap_enter wake scan as other completions.
Each process has a separate sleep waiter quota, so one Timer holder cannot fill
the global sleep queue by itself.
ThreadControl.setFsBase validates runtime-provided FS bases as user-canonical
addresses, updates the caller thread’s saved FS base, and writes the CPU FS
base immediately when the caller is the running thread. There is no
process-global FS base; context switch treats FS base as per-thread state.
The initial thread still uses the compatibility ring at RING_VADDR, while
each spawned child thread receives a kernel-chosen ring mapping in the process
ring arena. Run queues, per-CPU current, direct IPC handoff, Timer sleep
waiters, process/terminal waiters, endpoint caller/receiver records, and
deferred cancellation CQEs store generation-checked ThreadRef values and
route completions to the target thread’s ring endpoint. Process-owned thread
and kernel-stack ledger limits are enforced by ThreadSpawner.create before
additional thread records become runnable. The frozen contract is in
In-Process Threading. Park wait uses a separate
Blocked(Park { ... }) reason and park timeout/wake completions use reserved
CQE credits before marking generation-checked waiter threads runnable. The
authority and ABI contract is in Park Authority.
cap_enter(min_complete, timeout_ns) processes pending SQEs immediately. If
the requested completion count is not available and the timeout permits
blocking, the current thread enters Blocked(CapEnter { ... }) and the syscall
entry path switches to another runnable thread.
The LAPIC user-timer path enters sched::schedule() unconditionally on
every tick. An earlier slice carried a bounded user-mode continuation
fast path with a per-CPU one-skip budget and a release/acquire
slow-path-required summary; that path has been retired (see
docs/backlog/scheduler-evolution.md “Cleanup: Retire Benchmark-Driven
Scaffolding Before Phase D”). The fast path saved at most one scheduler
entry every other tick on an uncontended single-CPU-effective scheduler
while paying for shadow-state publication on every slow-path exit, so
the simpler always-schedule shape is preferred until a future Phase D
or Phase F slice ships an evidence pair where the fast path measurably
reduces scheduler-lock hold time on a contended SMP run.
When endpoint delivery satisfies a blocked server RECV, the scheduler can set a
direct IPC target. The next scheduling decision runs that server before ordinary
round-robin work when it is ready and its ThreadRef generation still matches
the captured direct target. When the direct slot is unavailable, endpoint
completions fall back to the queued path with WakePolicy::QueueCpu(slot)
targeting the current CPU’s per-CPU queue, so the wake scan probes the placed
CPU first.
Design
The implementation keeps ring dispatch outside the global scheduler lock. Timer dispatch extracts ring/cap/scratch handles, releases the scheduler lock, processes bounded SQEs, then reacquires the scheduler lock to choose the next thread. This prevents Cap’n Proto decode, serial output, and capability method bodies from running under the global scheduler lock.
There is no longer a slow-path-required summary or a per-CPU skip
budget for the user-mode timer path. Every user-mode LAPIC timer tick
enters sched::schedule(), which services run-queue entries, direct
IPC targets, deferred process termination/drop and thread-stack
cleanup, Timer sleep waiters, and blocked threads with timer-backed
cap_enter or Park timeouts under the scheduler lock. Those timeout paths
compare absolute monotonic deadlines, but periodic ticks still decide when the
checks run. Ring SQEs and ordinary cap waiters run on the same per-tick
cadence. Kernel-mode timer ticks (e.g., on AP cores parked in the kernel idle
loop) still go through kernel_timer_interrupt_handler, which sends EOI
without entering the scheduler. The shared advance_bsp_tick helper still
increments the compatibility TICK_COUNT only on CPU 0; normal runtime
accounting and timeout comparisons use monotonic_ns() instead. Future per-CPU
fair-share slices may reintroduce a continuation path under explicit Phase D
or Phase F authority; until then the always-schedule shape keeps the
scheduler’s authority over thread metadata and runnable ownership
single-source.
The runnable queues keep a single-owner contract behind the global
scheduler lock. A live generation-checked ThreadRef may have at most
one runnable dispatch owner across per-CPU current/handoff_current
slots, the per-CPU run queues, and the single direct_ipc_target
preference slot. Blocked waiters, sleep waiters, park waiters, endpoint
state, process waiters, and join waiters are not runnable owners; they
may make a thread ready only after liveness and generation checks
succeed.
Migration between per-CPU queues is represented as a scheduler-lock-
contained transfer, not as a second published owner. The source owner
is removed or popped first and the ThreadRef is then inserted in the
destination queue at the position determined by a freshly recomputed
virtual_finish_ns, or selected as the next running thread.
virtual_runtime_ns travels with the thread; virtual_finish_ns is
recomputed at every enqueue and never carried as committed state, so
weight or class mutations applied while the thread was blocked take
effect on the next dequeue and re-enqueue. Retry paths requeue the
candidate after dropping duplicate queued copies. Direct IPC keeps its
preference slot only while the target remains live and runnable; if
the direct target cannot run immediately, it falls back through the
normal queued-owner path on the current CPU’s per-CPU queue.
Idle-to-runnable wake targeting reuses the same ownership boundary. A
thread that becomes ready through endpoint completion, timer sleep,
park wake, process wait, or thread join is pushed to the placement
target’s per-CPU run queue, and wake_idle_scheduler_cpus_locked first
probes the placement target when the policy is QueueCpu, then walks
eligible idle scheduler CPUs to wake the first that accepts a fresh
reschedule IPI; CPUs that already have a pending IPI (or that fail
LAPIC delivery) are skipped without breaking the scan, so a burst of
ready work cross-wakes more than one neighbor instead of stranding the
rest behind one already-targeted CPU. Direct IPC uses the same path.
Measurement builds expose aggregate and per-phase counters for wake
scans, eligible idle CPUs, targeted CPUs, IPIs sent, already-pending
IPI skips, not-ready target skips, missing LAPIC targets, and send
failures.
Each per-CPU run queue is reserved up to the live runnable-capable thread count before publication; the shared live reservation count is released on process/thread exit or pre-publication rollback. Reserving each queue to the full live-thread count is required because the bounded steal path may migrate every live thread into a single sibling queue between two scheduler passes. Timer preemption, unblock, direct- IPC fallback, requeue, and steal-requeue paths therefore must not allocate while the thread is already live.
Process and thread exit cleanup proves the removal side of that ownership contract at the cleanup site. After removing queued owners and clearing a matching direct IPC target, the scheduler lock remains held while the kernel scans every per-CPU runnable queue and the direct target slot; any stale exiting process or thread reference is a kernel assertion failure. The focused spawn smoke asserts the corresponding serial proof markers on exercised process and thread exit paths.
The Phase C migration order is constrained by hardware state, not only by
scheduler data structures. The first gate moved syscall entry/exit off
BSP-symbol-relative PerCpu fields and onto KernelGsBase/swapgs on user
syscall paths, including blocking cap_enter, exit, and
ThreadControl.exitThread paths that leave through iretq rather than the
normal sysretq epilogue. The second gate added xAPIC initialization, a
PIT-calibrated BSP LAPIC timer tick, LAPIC EOI routing, AP LAPIC
initialization, a LAPIC spurious-vector handler, and an IPI vector plus bounded
vector-49-only fixed IPI send primitive. The third gate added address-space
resident CPU masks, per-CPU pending full-TLB flush generations, completion
waits, and a vector-49 TLB shootdown handler for user page-table map,
unmap, and protect. The fourth gate split current-thread tracking into
per-CPU slots, registers AP PerCpu records for current-thread and syscall
stack mirrors, updates AP TSS.RSP0 on context switches, and hands the single
scheduler-owner role to AP cpu=1 when it is online with a programmed LAPIC
timer.
The LAPIC slice replaces the BSP-oriented PIT/PIC scheduler tick on supported
QEMU and hardware paths. kernel/src/arch/x86_64/idt.rs keeps vector 32 for the
PIT/PIC fallback, reserves vector 48 for LAPIC timer delivery plus vector 49 for
cross-CPU requests, and installs vector 255 for LAPIC spurious interrupts.
pic.rs can remap and mask all legacy IRQs once LAPIC ticks are active, and
context.rs sends LAPIC EOI or PIC EOI according to the active timer source.
The IPI vector now handles TLB shootdown requests and bounded reschedule
requests for AP idle-to-runnable handoff.
The TLB slice wraps user page-table mutations that can affect an address space
resident on another CPU. AddressSpace::map, AddressSpace::unmap, and
AddressSpace::protect still perform the local x86_64 mapper flush, then
call the architecture shootdown helper with the address space’s resident CPU
mask. The helper records pending full-TLB flush generations for online resident
CPUs other than the caller, sends vector-49 IPIs, and returns a completion token.
Capability handlers drop the address-space guard and enqueue completion work;
cap_enter and timer polling drain that queue after ring dispatch releases the
cap-table and scratch locks. This keeps a remote syscall that is contending on
the same process locks from blocking maskable IPI delivery forever. Capability
handlers reserve fixed-size deferred queue slots before page-table mutation, so
full queues fail closed as capability overload errors instead of surfacing after
rollback, unmap, or protect has already changed state. Drains flush the current
CPU before waiting so a CPU that is itself in the target mask cannot wait on its
own pending generation. Target CPUs drain the generation in the IPI handler, at
syscall entry, or before returning to userspace from syscall, timer, and
scheduler restore paths.
Generation counters avoid losing overlapping shootdowns while a target CPU is
already draining a prior request. This relies on kernel user-buffer access
continuing through address-space-locked HHDM copy/read helpers rather than raw
user virtual addresses while a delayed flush generation exists. Callers include
VirtualMemoryCap dispatch through parse_map, parse_unmap, and
parse_protect, plus MemoryObjectCap::{map,unmap,protect} in
kernel/src/cap/frame_alloc.rs. Scheduler CR3 handoff now marks the selected
address space resident on the current CPU, including AP cpu=1 during the AP
scheduler-owner proof.
Idle paths
There are two distinct idle paths, and both run genuine CPL0 (kernel-mode) idle. There is no user-mode idle process: when no real work is runnable a CPU runs the kernel idle code at CPL0 on the kernel PML4. The two paths differ only in how the CPU got there.
The cooperative CPL0 kernel-mode idle path is the boot/AP path. start
(BSP), start_ap (APs), and the start_current_cpu loop call
next_start_context; when that returns no real runnable work they fall into
idle_current_cpu_once, which hlts at CPL0 on the per-CPU kernel stack with
interrupts enabled (no CpuContext, no restore_context — the same way
start_current_cpu itself runs). A kernel-CPL timer tick or reschedule IPI
taken during that hlt runs the kernel-mode handler
(kernel_timer_interrupt_handler / handle_reschedule_ipi, both of which call
nohz_recheck), so the nohz one-shot deadline is preserved and re-armed across
the hlt; control then returns to the loop, which re-checks for work.
idle_current_cpu_once increments the KERNEL_IDLE_HLT_ENTRIES counter and
emits a bounded
cpu-isolation: kernel-idle hlt cpu=… idle_path=cooperative-cpl0 … nohz_active=… timer_source=…
log line so this path is observable from the kernel log; the
run-scheduler-cpu-isolation-lease smoke asserts it is reached. Once any
dispatch path restore_contexts into a real thread, the start_current_cpu
frame is abandoned.
The steady-state CPL0 idle-thread path is reached from the four
interrupt/syscall-return dispatch call sites — schedule() (timer),
capos_block_current_syscall, exit_current, and exit_current_thread. When
choose_next_locked falls through to this CPU’s idle thread, each site builds
the dispatch tuple from the per-CPU CPL0 idle-thread context. The dispatch
call sites hand a CpuContext to assembly that restore_contexts (or, for the
timer path, return a context pointer plus a CR3 the timer handler loads), so
they need a schedulable context when no real work is runnable; the CPL0 idle
context is that context.
CPL0 idle-thread context infrastructure. arch::smp::init_idle_kernel_stacks
allocates one dedicated CPL0 idle kernel stack per scheduler CPU slot from
fresh contiguous frame ranges, so they do not overlap the boot kernel stacks,
the per-thread kernel stacks, or the IST slots. CpuContext::new_cpl0_idle
builds a kernel-shaped context (kernel-code/kernel-data selectors,
rip = kernel_idle_entry, rsp into the idle kernel stack). sched::sched_init,
called from kmain, constructs and stores one CpuContext per CPU slot in
CPL0_IDLE_CONTEXTS and then calls register_idle_process_locked to seed the
slot-0 synthetic idle Process record before the scheduler runs (this
keeps the BSP idle process’s low PID and the init-process PID ordering stable);
the remaining per-CPU slots are registered lazily by
current_cpu_idle_thread_locked the first time their CPU reaches idle.
sched_init panics on OOM, as does the lazy path: the CPL0 idle contexts and
the synthetic idle records are scheduler idle infrastructure and there is no
fallback idle path, so a failure to build them is unrecoverable. The idle
kernel stack is sized as a full per-thread kernel stack
(PROCESS_THREAD_KERNEL_STACK_PAGES), not an IST slot, because
kernel_idle_entry runs the deep service_periodic_work() call chain on it
(see periodic-service parity below).
Synthetic idle process records. The idle thread is never a runnable
user-mode process. The synthetic idle Process (Process::new_idle) maps no
user code, no user stack, and no cap ring, and carries an empty cap table. It
exists only so the idle ThreadRef resolves through sched.processes and the
scheduler’s ThreadRef-centric bookkeeping — set_thread_state,
account_thread_selected_locked, current-thread tracking, and the
is_idle_thread guard predicate used pervasively across the scheduler — keeps
working unchanged. Its address_space is a bare page-table root with nothing
user-mapped; it is required by the Process struct but is never loaded as
CR3. Every idle dispatch site routes the CPU onto the kernel PML4 via the
CPL0 idle context, so the synthetic idle AddressSpace is never made resident
and never participates in resident_cpu_mask or TLB-shootdown idle-residency
handling.
Dispatch-tuple rewire. After choose_next_locked returns, when the chosen
thread is idle_threads[current_cpu_slot()], each dispatch site builds the
dispatch tuple from the CPL0 context pointer, the dedicated idle kernel stack
top, the kernel PML4 CR3, and the current FS base (no FS-base change).
sched_init builds one CPL0 idle context per scheduler CPU slot or panics, so
cpl0_idle_context(slot) is infallible at every dispatch site. The
schedule() timer path does not route through a dedicated CR3-loading
restore helper: the existing timer_interrupt_handler already loads the
tuple’s CR3 with write_cr3 before the privilege-agnostic five-element
iretq. The three syscall-path sites (capos_block_current_syscall,
exit_current, exit_current_thread) keep their
restore_context_after_syscall restore tail: they are entered via
syscall_entry (which already executed swapgs), so the exit swapgs is
required to leave the CPL0 idle thread running with the user GS base — the
same GS-base state the timer path’s CPL0 idle thread runs with. Each site emits
a distinct marker: sched: dispatch idle cpu=N idle_path=cpl0-dispatch-timer
(timer), …cpl0-dispatch-block (blocking syscall), and …cpl0-dispatch-exit
(both exit_current and exit_current_thread). debug_assert!s guard the
CPL0 dispatch tuple: context cs/ss are the kernel selectors and their RPL
bits are 0.
CPL0 idle periodic-service parity. schedule()’s timer Phase 2 runs
periodic service work on every tick — deferred process drops, pending
terminations, wake_cap_waiters, service_sqpoll_workers(),
drain_pending_endpoint_cancellations(), terminal_session::poll_input(),
virtio::poll_scheduler(), and the network / pipe / interrupt poll_waiters()
calls. A CPL0 idle thread’s timer ticks are kernel-mode and go through
kernel_timer_interrupt_handler, which never enters schedule() — so without
explicit parity handling that servicing would be stranded whenever a CPU is
parked on the CPL0 idle thread. That work is factored into a single
service_periodic_work() function with one lock discipline: the scheduler lock
is taken only for the bounded deferred-drop / thread-stack-release /
wake_cap_waiters / pending-termination extraction, then dropped before
drop_pending_process / finish_terminated_process and the lock-free poll
block. schedule() calls it after ring dispatch; kernel_idle_entry is its
own cooperative loop that, each iteration, runs service_periodic_work(), then
next_start_context(false) to re-dispatch a real runnable thread the moment
one appears (allow_idle = false so it never re-selects the idle thread), then
idle_current_cpu_once() to hlt. The re-dispatch is required: without it a
kernel-mode timer tick taken during the idle hlt returns through
kernel_timer_interrupt_handler, which does not re-enter schedule(), so the
CPU would be stranded. service_periodic_work() and next_start_context() run
with interrupts disabled in that loop — the CPL0 idle context is built
IF=1 so the periodic tick can preempt the hlt, so the loop must cli
before the deep service call; otherwise a CPL0 timer tick taken during
service_periodic_work() nests a kernel_timer_interrupt_handler frame onto
the idle kernel stack (same-privilege interrupts do not switch stacks).
idle_current_cpu_once re-enables interrupts only across its enable_and_hlt
and disables them again before returning. There is no double-service: a CPU
running a real thread gets the service block via schedule(), a CPU on the
CPL0 idle thread gets it via the kernel_idle_entry loop, and a given tick on
a given CPU is CPL3 (schedule()) xor CPL0-idle (the loop). nohz cadence stays
honest because the loop iterates at the timer/IPI cadence — when the periodic
tick is suppressed the re-armed one-shot still wakes the hlt, so
service_periodic_work() still runs.
iretq CPL0 restoration invariant and CPL0 idle-thread prerequisites
This subsection records the load-bearing x86-64 architectural invariant that any future CPL0 idle-thread context migration must satisfy, along with the prerequisites the implementation will need to meet.
Authoritative reference: Intel 64 and IA-32 Architectures Software
Developer’s Manual (SDM), Volume 2A, IRET/IRETQ instruction reference,
“Operation” pseudocode (the IF OperandSize = 64 / 64-bit-mode path), and
Volume 3A, Section 6.14.3 “Returning from an Exception or Interrupt
Procedure.” The description below applies to IRETQ in 64-bit long mode;
the legacy 32-bit IRET paths behave differently and are called out
explicitly where it matters.
iretq frame layout and the 64-bit unconditional five-element pop.
iretq in 64-bit long mode unconditionally pops five 64-bit (8-byte)
values from the top of the current kernel stack, in order: RIP, CS,
RFLAGS, RSP, SS. This is true regardless of whether the privilege
level changes — both a CPL0→CPL3 return and a CPL0→CPL0 return consume the
same five-element frame and load RSP:SS from it. AMD deliberately removed
the legacy conditional stack switch for long mode: the “skip SS:ESP on a
same-privilege return” behavior exists only in the legacy 32-bit IRET
operand-size paths, never in IRETQ.
- CPL0 → CPL3 (privilege change, ring exit): The target
CShas RPL=3, which differs from the current CPL=0. The CPU installsRIP,CS, andRFLAGSfrom the frame, then loadsRSPandSSfrom the same frame and transfers to the user-space instruction atRIPon the user stack. - CPL0 → CPL0 (same-privilege, no ring change): The target
CShas RPL=0, matching the current CPL=0.iretqstill pops all five elements: it installsRIP,CS, andRFLAGS, and also loadsRSPandSSfrom the frame, exactly as in the CPL3 case. There is no same-privilege short-circuit in 64-bit mode. The practical consequence for a CPL0 restore is the opposite of the legacy intuition: the frame’srspandssfields are load-bearing and must carry a valid kernel stack pointer and a valid RPL=0 stack selector, because the CPU will load them.
Current code. restore_context (kernel/src/arch/x86_64/context.rs
lines 311–328) sets RSP to the supplied CpuContext pointer, pops all
fifteen caller-saved and callee-saved GPRs (lines 315–327), and executes
iretq (line 328). The CpuContext struct (context.rs lines 133–155)
places rip, cs, rflags, rsp, and ss at the high end of the struct
(lines 150–154), matching the hardware interrupt-frame layout that the CPU
pushes when it enters the timer interrupt handler. The comment at line 149
(“Pushed by CPU on interrupt from Ring 3”) reflects how every CpuContext is
populated today, but the five-element iretq frame itself is not
CPL3-specific — iretq consumes the same five elements for any target CPL.
User-thread contexts. Every user-thread CpuContext is built by
Thread::new_user (kernel/src/process.rs), which sets
cs = sel.user_code.0 as u64 (RPL=3, value 0x23) and
ss = sel.user_data.0 as u64 (RPL=3, value 0x1B). Every iretq issued by
restore_context or restore_context_after_syscall into a user thread is
therefore a CPL0→CPL3 privilege change into a fully user-shaped context.
CPL0 idle contexts coexist with user contexts. The blocker for a CPL0
target is not iretq frame arithmetic: iretq pops the same five elements
for a CPL0 target as for a CPL3 target, so a frame carrying kernel selectors
and a valid kernel rsp iretqs correctly. The real requirements are in the
surrounding dispatch plumbing, all of which the CPL0 idle path satisfies:
- CR3. The dispatch call sites set
CR3to the kernel PML4 for the CPL0 idle path, not to any userAddressSpacepage table. The synthetic idleProcess’sAddressSpaceis never loaded as CR3. swapgs/ GS-base. A CPL0 idle context was never entered through thesyscallpath. Theschedule()timer path reaches it through the timer handler’s own CR3 load and the privilege-agnosticiretqtail (noswapgsin that path at all). The three syscall-path sites (capos_block_current_syscall,exit_current,exit_current_thread) keep theirrestore_context_after_syscalltail: those sites were entered viasyscall_entry(which alreadyswapgsed), so the exitswapgsis required to undo it — leaving the CPL0 idle thread running with the user GS base, the same state the timer path produces.- Kernel-code and kernel-data selectors. A CPL0
CpuContextusescs = sel.kernel_code.0 as u64(RPL=0, value0x08) andss = sel.kernel_data.0 as u64(RPL=0, value0x10). Becauseiretqloadsssunconditionally in 64-bit mode,ssmust be a valid RPL=0 stack selector; the GDT data-selector privilege checks require an RPL=0ssto be paired with an RPL=0cs, so the whole context (cs,ss,rsp, CR3, GS base) is kernel-shaped together. - Idle kernel stack. Each CPL0 idle thread has its own dedicated kernel
stack (
arch::smp::init_idle_kernel_stacks) that does not overlap any IST slot, any per-thread kernel stack, or the BSP/AP boot stacks. Becauseiretqloadsrspfrom the frame, the context’srsppoints into this dedicated stack. It is sized as a full per-thread kernel stack becausekernel_idle_entryruns the deepservice_periodic_work()call chain on it. - No user
AddressSpaceresidency. The synthetic idleProcess’sAddressSpaceis never made resident and never participates inresident_cpu_mask, so TLB shootdown never stalls waiting for an idle CPU. - No blocking, no exit. The idle thread never calls
cap_enter, parks, blocks on any waiter, or exits. TheInvariantssection entry “The idle thread must never block incap_enteror exit” carries forward unchanged.
CpuContext::new_cpl0_idle builds the kernel-shaped context,
sched::kernel_idle_entry is the entry point, and sched::sched_init wires
the per-CPU CPL0 idle contexts and seeds the slot-0 synthetic idle process
record (the remaining slots’ records are registered lazily by
current_cpu_idle_thread_locked). All four dispatch call sites — schedule(),
capos_block_current_syscall, exit_current, exit_current_thread — route
idle dispatch onto the CPL0 idle context: the timer path returns the CPL0
context pointer plus the kernel PML4 CR3 in its dispatch tuple and relies on
the existing timer_interrupt_handler CR3-load; the three syscall-path sites
keep their restore_context_after_syscall tail so the syscall-entry swapgs
is undone. The CPL0 contexts are kernel-shaped across cs, ss, rsp, and
CR3 together.
Measurement Policy
Design grounding for this policy: this document’s scheduler invariants,
docs/backlog/scheduler-evolution.md,
docs/proposals/scheduler-evolution-proposal.md,
docs/research/future-scheduler-architecture.md,
docs/research/out-of-kernel-scheduling.md,
docs/research/nohz-sqpoll-realtime.md, and
docs/research/completion-ring-threading.md. In particular,
docs/research/future-scheduler-architecture.md keeps the always-on versus
benchmark-only scheduler telemetry split as an open scheduler question, and the
current answer is intentionally conservative.
The current kernel/src/measure.rs counters are benchmark instrumentation, not
normal operator observability. They stay behind the measure feature and
CAPOS_THREAD_SCALE_GUEST_MEASURE=1 because they add atomics, cycle-counter
reads, phase bookkeeping, and in some cases sampled user RIP values to hot
scheduler, timer, TLB, ring, and serial paths. Normal QEMU and dispatch builds
must not depend on those counters being present.
The per-thread runtime-accounting ledger is split. The WFQ load-bearing core
fields, runtime_ns, virtual_runtime_ns, and last_started_ns, are
unconditional normal-build state on ThreadCpuAccounting: WFQ ordering,
SchedulingPolicyCap.snapshot, and SchedulingContext budget charging depend
on them outside cfg(feature = "measure"). The diagnostic fields
(context_switches, preemptions, voluntary_blocks, migrations,
last_cpu, blocked/exited stability probes, placement buckets, and per-phase
attribution counters) stay behind the measure feature. Permanent operator
observability is still separate work: it should expose low-rate, non-symbolic
snapshots derived from the unconditional ledger plus event counters such as
runnable queue depths or high-water marks, reschedule IPI sent/failed/pending
counts, TLB shootdown request/failure counts, and scheduler policy admission
or denial counts. Those counters must not allocate, log, read raw user PCs, or
perform cycle-timing in timer, unblock, direct-IPC fallback, requeue, or
steal-requeue paths.
Benchmark-only attribution stays in measure: per-phase thread-scale
checkpoints, guest cycle timings for ring/capnp/method/scheduler segments,
scheduler-lock wait and hold cycles, scheduler-lock site attribution, serial
byte attribution, timer-mode breakdown, CR3/TLB event totals, thread-placement
selection/migration buckets, raw user-PC samples,
logging-suppression A/B evidence, and workload/cacheline diagnostics. The
publish-placement publish/caller-aware buckets were retired with the per-CPU
run-queue collapse. Phase D shipped the fair-share enqueue policy but did not
reintroduce those placement counters.
A future branch may promote a specific event count only by adding the
normal-build storage/API and proving the same emergency-path constraints; it
should not simply remove the current cfg(feature = "measure") boundaries from
the benchmark module.
The publish-placement publish/caller-aware buckets are still retired;
Phase D Task 3 brought back per-CPU placement semantics but does not
re-emit the publish counters. Re-instate them through a separate
operator-observability slice that proves the same emergency-path
constraints, not by removing the existing cfg(feature = "measure")
boundary on the historical buckets.
Tickless idle is enabled only for true idle. A scheduler-owned CPU may mask the
periodic LAPIC tick when it is running the CPL0 idle context, has no runnable
non-idle work, has no active CpuIsolationLease nohz record, has no local
deferred cleanup, has no cap-enter polling dependency, and the one-shot
clockevent plus non-tick-derived monotonic clocksource are available. The
replacement one-shot is bounded by the nearest Timer/ParkSpace deadline or
a 100 ms idle housekeeping floor, and the scheduler restores
periodic mode before non-idle dispatch, reschedule-IPI wake, or rollback.
Cap-enter polling waiters, including the current terminal shell path, and
ready threads paused in a SchedulingContext retry window keep the periodic
tick until those dependencies move behind explicit deadlines or housekeeping
placement.
Generic full-nohz for ordinary budgeted compute threads carries the clockevent/deadline substrate into the CPU-isolation state machine and suppresses ticks only after network polling, IRQ affinity, accounting, deadline, lifetime, and rollback obligations pass. SQPOLL nohz applies the same substrate to explicitly leased caller-thread rings once the SQPOLL worker is live and the single-consumer, owner-lease, wake, and rollback gates pass. Automatic policy issuance and broader SQPOLL userspace-poller/device-queue admission remain separate later CPU-isolation features; see Tickless and Realtime Scheduling and NO_HZ, SQPOLL, and Realtime Scheduling.
Exit switches to the kernel PML4 before tearing down the exiting address space,
releases capability authority, completes process waiters, defers final process
teardown until the scheduler is running on another kernel stack, and then
releases remaining thread kernel stacks through the scheduler-owned
OffStackToken path before the Process value is dropped.
Invariants
- The idle thread must never block in
cap_enteror exit. - Ring dispatch must not hold the scheduler lock.
- Timer dispatch copies current-process user buffers through that process’s
locked
AddressSpace; it must not rely on a raw current-CR3 validate/use window. - Blocked
cap_enterwaiters wake when enough CQEs are available or their finite timeout expires. - Timer sleep waiters must be bounded per process, tied to the caller
ThreadRefgeneration, and removed when the caller process exits. - Runtime-controlled FS bases must stay in user canonical space.
- Direct IPC handoff is a scheduling preference, not a bypass of process liveness, generation, or state checks.
- The scheduler must update TSS.RSP0 and the per-CPU syscall kernel RSP
through
percpu::set_kernel_entry_stackon each switch. - Each
PerCpu.current_threadmirrors that CPU’s scheduler current slot; the scheduler lock remains the authority for current-thread and queue ownership even though dispatch/runnable state is now separate from shared process and thread metadata. - Each live
ThreadRefmay appear in the per-CPU runnable queues at most once across all queues, and every per-CPU queue’s capacity must be reserved up to the live runnable-capable thread count before a new process or thread becomes runnable. - A live generation-checked
ThreadRefmust have at most one runnable dispatch owner across per-CPUcurrent/handoff_currentslots, the per-CPU runnable queues, and the direct IPC target. - Queue migration (including the bounded steal path) must be a
scheduler-lock-contained remove-before-publish transfer; no path may
publish the same
ThreadReftwice into any queue or leave a stale direct target after exit. Migration must recomputevirtual_finish_nsat the destination and never carry the source’s WFQ tag as committed state. - Each per-CPU run queue must remain ordered ascending by
virtual_finish_nsafter every enqueue, requeue, or steal-requeue. Local selection scans the queue by index for the first destination-Runnable entry; RetryLater entries are left in place for the next scheduler pass. The bounded steal path scans each sibling queue’s indices ascending for that queue’s first Runnable-for- destination entry — because each queue is ordered ascending, the first Runnable hit per queue is the lowestvirtual_finish_nscandidate the destination can accept on that source — then picks the source queue whose first-Runnable candidate has the lowestvirtual_finish_nsglobally, with ties broken by lower CPU id. The chosen entry is removed from its actual position on the source queue (not necessarily the head). - Process and thread exit cleanup must assert, before releasing the scheduler lock, that the exiting process or thread has no remaining entry in any per-CPU runnable queue and no remaining direct IPC target slot.
- Timer, unblock, direct-IPC fallback, requeue, and steal-requeue paths must use reserved run-queue capacity and avoid allocation.
- Runtime accounting must use the normal monotonic clocksource, not benchmark-only cycle counters, and must charge only running intervals.
- FS base is saved and restored across context switches for TLS.
- Thread records remain generation-checked
ThreadRefidentities; exited records are retained only while a live handle, pending join, or unjoined status can still observe them. - The final teardown of an exiting process must not release thread kernel
stacks until another kernel stack is active, and the implicit
Thread::Droppath must not free kernel-stack frames. - A scheduler CPU must never run the same generation-checked
ThreadReftwice at once; same-process siblings may run on different scheduler CPUs only when their completions route through distinct per-thread ring endpoints. - Park waiters must be keyed by generation-checked
ThreadRefvalues, reserve one waiter CQE credit, and must not allocate in wait, wake, timeout, or process-exit cleanup paths.
Code Map
kernel/src/sched.rs- shared process table plusSchedulerDispatchownership of the per-CPU runnable queues (ordered ascending byvirtual_finish_ns), per-CPU current/handoff slots, idle-thread slots, direct IPC target, run-queue reservation accounting, pending drops, and pending stack releases; also blocking, wakeups, Timer sleep waiters, the bounded steal path, and exit.kernel/src/arch/x86_64/context.rs- CPU context layout, timer entry/restore, tick counter.kernel/src/arch/x86_64/idt.rs- timer and IPI interrupt handler wiring.kernel/src/arch/x86_64/lapic.rs- xAPIC MMIO setup, PIT-calibrated LAPIC timer, LAPIC EOI, spurious-vector handling, and fixed-IPI send primitive.kernel/src/arch/x86_64/tlb.rs- serialized vector-49 TLB shootdown request, pending flush generations, completion token, and interrupt/user-return drain path.kernel/src/arch/x86_64/pic.rsandkernel/src/arch/x86_64/pit.rs- legacy PIC remap and PIT fallback setup.kernel/src/arch/x86_64/gdt.rs- BSP/AP TSS and kernel stack storage.kernel/src/arch/x86_64/syscall.rs- blocking syscall transition forcap_enter.kernel/src/arch/x86_64/percpu.rs- per-CPU syscall stack registry, TSS.RSP0 update hook, and current thread storage.kernel/src/arch/x86_64/tls.rs- FS base save/restore.kernel/src/process.rs- process state, kernel stacks, the synthetic idle process record, and per-thread CPU accounting storage/accessors.
Validation
make run-smokevalidates timer preemption, ring fairness, direct IPC handoff, blockedcap_enterwakeups, process exit, and clean halt.make run-spawnvalidates process wait blocking and child exit completion throughProcessHandle.wait, Timer monotonic now/sleep completion throughtimer-smoke, per-process sleep quota isolation throughtimer-flood, and thread/park lifecycle behavior throughthread-lifecycle.make run-measurevalidates the post-thread park blocked/resume timing path and process exit while a park waiter is parked.cargo build --features qemuverifies QEMU-only scheduler and halt paths.- QEMU smoke output for IPC includes direct handoff diagnostics when the server is woken from a blocked RECV.
Open Work
- Prove SQPOLL/poller progress that does not depend on periodic scheduler ticks before automatic nohz activation. Then implement tickless idle only for no-runnable-work CPU idle. Keep runnable contention on periodic preemption until the activation proof closes the remaining network polling, IRQ affinity, and housekeeping dependencies.
- Keep SMP behind per-CPU scheduler state and review of any path that needs
page pinning beyond the
AddressSpace-locked copy/read contract. - Implement the remaining SMP Phase C slices: split shared scheduler metadata, replace the temporary scheduler-owner mask, and collect accepted benchmark evidence.
- Add priority or policy scheduling only after the current authority and IPC semantics remain stable.
- Add service restart policy outside the static boot graph.