Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Scheduler Evolution

capOS should evolve its scheduler in layers. The goal is not one clever algorithm; it is a capability-shaped CPU subsystem that scales ordinary work, admits realtime islands, allows service/runtime-specific policy, and preserves a small auditable kernel dispatch path.

This proposal complements, rather than replaces, Tickless and Realtime Scheduling. That proposal owns timer/tickless/SQPOLL-nohz details. This proposal owns the broader scheduler architecture and roadmap.

Design Grounding

Local grounding:

Goals

  • Keep protected dispatch, budget enforcement, interrupt handling, and idle in the kernel.
  • Replace the single global runnable queue with per-CPU runnable ownership and bounded cross-CPU wake/migration.
  • Add CPU accounting before adopting policy that depends on runtime charge.
  • Make ordinary best-effort scheduling fair by virtual time, with EEVDF-like virtual-deadline scheduling as the target after accounting exists.
  • Represent admitted CPU time as SchedulingContext capability authority.
  • Represent isolated CPU ownership as CpuIsolationLease authority.
  • Support user-space scheduler policy services for admission and tuning without putting user-space calls on every dispatch path.
  • Provide enough telemetry to distinguish scheduler cost, serial/MMIO logging, TLB/CR3 effects, QEMU/KVM artifacts, and workload contention.

Full-SMP Scalability Focus

The scheduler work after the current Phase F chain should be judged by whether capOS can keep useful throughput and bounded scheduling overhead on 16/32-core machines, not by another small QEMU-only speedup row. The SMP proposal owns CPU bring-up and APIC/TLB substrate; this proposal owns the scheduler changes needed to make that substrate useful at higher core counts.

The scheduler side of the milestone should include:

  • dynamic scheduler CPU sets derived from discovered topology instead of the temporary four-owner mask;
  • per-CPU run queues and current-thread state that do not require one shared lock for ordinary local pick/requeue paths;
  • narrower shared metadata locks for process/thread lookup, blocking waiters, exit cleanup, direct IPC handoff, and timer/deadline waiters;
  • bounded cross-CPU wakeup and migration that records target, source, steal, reschedule-IPI, and failed-placement counters;
  • topology-aware placement that separates physical cores, SMT siblings, and later NUMA/cache groups;
  • total-time accounting for spawn/join/exit and service-bound workloads, not only syscall-free work windows;
  • hardware-run artifacts that include native Linux baselines on the same machine and QEMU rows only as regression or virtualization context.

The benchmark shape should include static map/reduce, uneven dynamic tasks, barrier-heavy phase loops, independent processes, same-process threads, and a capability-call/service-bound workload. That matrix is intentionally broader than the old thread-scale checksum row because high core counts often expose lock convoying, wakeup storms, timer/IPI cost, TLB-shootdown scaling, and runtime lifecycle overhead before pure compute saturates.

Non-Goals

  • Do not import Linux CFS/EEVDF, FreeBSD ULE, or sched_ext as code.
  • Do not expose arbitrary user-supplied scheduler programs in the kernel in the near term.
  • Do not make a user-space process the mandatory next-thread dispatcher.
  • Do not claim hard realtime until admission, budget enforcement, IRQ/device behavior, kernel-path latency, and WCET evidence exist.
  • Do not make nohz/full-nohz a thread flag. It is a CPU lease plus scheduler proof.

Architecture

The target scheduler has four layers:

  • Kernel mechanism: per-CPU run queues, current-thread state, idle, context switch, cross-CPU wake/migration, timer/IPI handling, CPU accounting, budget enforcement, and timeout/depletion faults.
  • Kernel policy primitives: best-effort weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, direct IPC donation, and realtime-island hooks.
  • Privileged scheduler policy service: admission, budget/profile selection, CPU partitioning, isolation grants, service/runtime hints, policy reload, and operator diagnostics.
  • Application/runtime schedulers: work stealing, actors, async reactors, language M:N schedulers, request queues, and service-local priority and batching.

The hot path remains local and bounded: timer interrupt or wakeup, charge runtime, update runnable state, pick from a per-CPU queue or a bounded steal path, switch context. User-space policy participates at slower boundaries: profile changes, thread/process creation, budget depletion, realtime admission, lease grant/revoke, or explicit operator policy updates.

Stateful task/job graph coordinators sit above these layers. They may own graph node queues, leases, retry state, cancellation, and assignment metadata, but they do not own CPU dispatch. A graph node’s priority, deadline, budget, or queue field is workload policy until a capability-authorized scheduler policy service maps it to a weight, scheduling context, CPU lease, or request deadline.

Stage 0: Evidence Before Policy

Before changing the default policy, the active thread-scale attribution work must keep policy conclusions separated from benchmark artifacts. Current mainline evidence now includes:

  • scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock, timer interrupt, and CR3/TLB counters behind CAPOS_THREAD_SCALE_GUEST_MEASURE=1;
  • raw guest-PC samples for user-mode timer preemption points;
  • logging-suppression A/B evidence through CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1;
  • exact native Linux pthread baseline evidence, including compact-versus-padded result-slot diagnostics;
  • larger-workload/Amdahl evidence through CAPOS_THREAD_SCALE_TOTAL_BLOCKS and LINUX_THREAD_SCALE_TOTAL_BLOCKS.

This evidence does not prove the primary remaining cause of non-scaling. Per-CPU runnable ownership, accepted work/total speedup thresholds, and optional symbolic guest attribution remain follow-on work before a scheduler policy claim.

This protects the design from treating QEMU/KVM, serial MMIO, or benchmark cache contention as a scheduler algorithm problem.

Stage 1: Per-CPU Runnable Ownership

Split the scheduler’s runnable state first. The accepted initial shape has per-CPU run queues with a runnable ThreadRef deque or priority buckets, current-thread state, a local reschedule flag, and local counters. Shared scheduler state keeps process/thread metadata, sleeping/deadline waiters, blocked waiters, migration records, and the global policy epoch.

Rules:

  • A runnable ThreadRef is owned by exactly one CPU queue at a time.
  • Cross-CPU wake enqueues to the target CPU or a policy-selected CPU and sends a bounded reschedule IPI when needed.
  • Migration removes from one owner before publishing to another.
  • Idle CPUs steal only through bounded policy, not by scanning every process.
  • Process exit and thread exit keep cleanup bounded and must not allocate in interrupt, cancellation, or emergency paths.

This stage may still use round-robin within each CPU queue. The objective is SMP structure and evidence, not perfect fairness.

First implementation evidence exists as commit 1a8bf909: capOS introduced four bounded per-scheduler-CPU FIFO runnable queues under the existing global scheduler lock. That slice proved the basic ownership structure and bounded steal path. Follow-up review fixes reserved per-CPU queue capacity before a thread became runnable, using a live reservation count released on process/thread exit or pre-publication rollback, so timer and unblock requeues did not allocate after work moved between CPUs. Update 2026-05-02: the per-CPU queues were collapsed back into a single global runnable queue under the same scheduler lock with the per-CPU run-queue-collapse cleanup slice (see docs/backlog/scheduler-evolution.md and docs/architecture/scheduling.md). Update 2026-05-07 23:45 UTC: Phase D Task 3 reintroduced the per-CPU runnable queues, this time ordered ascending by virtual_finish_ns (Weighted Fair Queueing) and balanced by a bounded steal path that picks the most-overdue sibling Runnable candidate (each sibling queue’s first entry the destination CPU considers Runnable; ties broken by lower CPU id). The queue ownership and migration contract is documented in the scheduling architecture page. This does not close the stage: the scheduler still needs stronger cross-CPU wake counters, further separation from shared process/thread metadata, replacement of temporary pinning policy, and accepted benchmark evidence before policy conclusions should change.

Stage 2: CPU Accounting

Add a monotonic runtime charge model. ThreadCpuAccount records runtime, last-start time, virtual runtime, context switches, preemptions, and voluntary blocks. SchedEntity records weight, latency class, eligible time, and virtual deadline.

Accounting must be stable enough to support fair scheduling, quotas, and future scheduling contexts. It must account context switches, blocking syscalls, endpoint direct handoff, timer preemption, thread exit, and idle.

Where exact cycle attribution is not yet credible, the implementation should label the metric as diagnostic rather than enforcing policy from it.

Stage 3: Best-Effort Fair Policy

Stage 3’s first implementation slice has landed. Phase D passed its Task 6 evidence gate at commit 77caafc0 (2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate) and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d) with weighted fair queueing (WFQ) as the accepted best-effort policy. The controlled Task 6 benchmark pair recorded capOS 1-to-4 work/total speedups 3.088x / 2.700x at 4 workers, materially closing the prior single-global-queue 1.566x / 1.538x diagnostic gap while the matching Linux pthread baseline on the same host and physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. The completed execution plan is archived at docs/backlog/scheduler-evolution.md.

After Phase D, capOS should continue ordinary best-effort scheduling from WFQ toward virtual-time fairness with stronger eligibility semantics only when that follow-on is explicitly selected.

The long-term target policy is EEVDF-like:

  • runnable entities accrue lag against their fair share;
  • eligible entities are ordered by virtual deadline;
  • weights affect virtual runtime/deadline progression;
  • latency-sensitive best-effort entities can request smaller slices within policy limits;
  • migration preserves accounting so moving CPUs does not reset fairness.

The first implementation slice was intentionally narrower than EEVDF: weighted fair queueing on top of the existing per-thread runtime/vruntime accounting. That decision and its accepted evidence are recorded in the next subsection.

Phase D first-policy decision (2026-05-05 19:00 UTC)

Decision: weighted fair queueing (WFQ) for the first Phase D slice; EEVDF remains the deferred follow-on. Recorded against main commit 60e421ab and the 2026-05-02 21:38 UTC thread-scale evidence pair against main commit 374f8556 (capOS work 1.566x versus Linux 3.963x at 1-to-4 on the same physical-core pin set).

Rationale (concise):

  • The 1-to-4 gap is dominated by single-global-queue scheduler-lock contention plus exit/join/block/schedule overhead, not by ordering. Any fair-share policy that successfully consumes a per-CPU split should close most of the gap. The simpler policy reaches that signal sooner with less risk.
  • The existing ThreadCpuAccounting record separates the load-bearing ledger from benchmark diagnostics: runtime_ns, virtual_runtime_ns, and last_started_ns are unconditional, while context_switches, preemptions, voluntary_blocks, migrations, placement history, and blocked/exited stability probes stay behind cfg(feature = "measure"). WFQ needs only a per-thread weight and a virtual finish time derived from the unconditional vruntime; that mapping is direct. EEVDF additionally needs a per-thread request size, lag, eligibility deadline, and an ordered eligible-set structure (BTreeMap by virtual deadline). The runtime/vruntime accounting fields exist, but the eligibility/lag fields do not.
  • The target environment is no_std plus spin::Mutex plus a single global scheduler lock. WFQ keeps the eligibility structure as a bucketed per-CPU FIFO ordered approximately by virtual finish time; that is a familiar VecDeque-shaped data structure that mirrors the current run_queue: VecDeque<ThreadRef> ownership. EEVDF requires an ordered set inside the scheduler-lock-protected dispatch state, which is a larger structural change than the slice the gap evidence motivates.
  • Latency-class differentiation (interactive / batch / IPC server) is expressible in WFQ; Phase D pins the mapping below in the capability-surface section so the implementation slice and the short-sleeper smoke have one rule. The Phase H policy service can layer richer policy on top without requiring a tree representation underneath.
  • Linux moved from CFS to EEVDF in mainline 6.6 (released 2023-10); WFQ has decades of stable OS lineage. Either choice is defensible. The weighted-fair slice does not lock capOS into WFQ permanently — the same accounting fields, capability surface, and migration contract carry directly into EEVDF when the eligibility structure is added.

Rejected alternative: EEVDF-first. It is the stronger long-term policy and Linux’s current default. We are not picking it for the first slice because (1) the eligibility-set data structure is a larger diff that mixes structural change with the per-CPU enqueue reintroduction the 1-to-4 gap evidence already motivates; (2) the lag accounting and request-size ABI are not load-bearing for closing the single-global-queue contention bottleneck the recorded benchmark exposes; (3) moving from WFQ to EEVDF is a localized policy-module change once the capability surface, migration contract, and per-CPU queue split are accepted. The deferred EEVDF follow-on is tracked as a later policy-evaluation slice; it is not a Phase D blocker and does not displace Phase E SchedulingContext, which is the next scheduler authority phase after the accepted WFQ gate.

First-slice scope (smallest implementable surface that closes the 1-to-4 gap):

  • per-thread weight: u16 and latency_class: LatencyClass fields, default values matching the current single-class FIFO behavior; the cap-boundary path rejects weight = 0 and any nonzero value outside [MIN_WEIGHT, MAX_WEIGHT] (Phase D constants) with CapException::InvalidArgument rather than silently clamping, so no later divide-by-zero or overflow path can be reached through setWeight and so callers see policy denial instead of a hidden mutation. The invalidArgument variant landed in ExceptionType alongside SchedulingPolicyCap and LatencyClass with Phase D Task 1 (commit cb8c58b1, 2026-05-07); see docs/proposals/error-handling-proposal.md for the updated client-response taxonomy. The full validation rule lives in the cap-surface authority section below; this bullet records only that the validation runs at the cap boundary, not the dispatch path;
  • per-thread weighted vruntime charging at runtime-charge points: the existing ThreadCpuAccounting.virtual_runtime_ns advances by elapsed_ns * REFERENCE_WEIGHT / weight (instead of the current 1:1 elapsed) on every charge_runtime call. runtime_ns continues to advance 1:1 with elapsed time so monotonic CPU accounting, measure-mode reporting, and snapshot APIs are unchanged. The weighted-vruntime change is the actual fairness mechanism; without it, weights affect only enqueue-order ties rather than cumulative share. This matches the CFS-lineage approach and keeps the WFQ derivation virtual_finish = vruntime + slice * REFERENCE_WEIGHT / weight purely as an ordering aid for the local bucket;
  • per-thread virtual_finish_ns: u64 recomputed at each enqueue from virtual_runtime_ns + slice_ns * REFERENCE_WEIGHT / weight. It is not stored across blocking and is never carried as committed state; it is the per-enqueue ordering tag only;
  • per-CPU bounded run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] (reintroduced) each ordered ascending by virtual_finish_ns; local selection scans the queue by index for the first destination-Runnable entry (RetryLater entries left in place; the first Runnable hit is also the lowest virtual_finish_ns candidate the destination can accept because the queue is ordered), then falls back to a bounded steal scan of sibling per-CPU queues;
  • scheduler-lock-contained migration that keeps virtual_runtime_ns with the thread (per-thread state, not per-CPU) and re-inserts on the destination CPU at the post-migration virtual finish time;
  • a capability-authorized policy path (see §“Phase D capability surface” below) that gates weight/latency-class mutation and reads;
  • one-bisect-cycle single-global-queue fallback under CAPOS_SCHED_DISABLE_WFQ=1, now retired by Phase E preflight before SchedulingContext schema work.

The first slice is accepted: the 2026-05-10 19:46 UTC make run-thread-scale evidence pair recorded in docs/changelog.md and docs/benchmarks.md passed the harness-enforced 1-to-2 work/total gates, and Phase D manually accepted the recorded 1-to-4 work/total diagnostics for closeout. The historical success threshold lives in docs/backlog/scheduler-evolution.md.

Phase D capability surface (kernel-side authority, no ambient process fields)

Per docs/capability-model.md “the interface IS the permission”, weight and latency-class authority is granted by giving a process a SchedulingPolicyCap with the appropriately scoped target. The kernel rejects any state mutation that does not arrive through such a cap.

Schema (landed with Phase D Task 1, commit cb8c58b1, 2026-05-07; the original sketch took a target :ThreadHandle per method, but the methods carry no target argument because Phase D associates the target through cap state, not a per-method handle parameter. Phase D Task 2 (closeout 2026-05-07 22:51 UTC) selected the context-derived caller-thread fallback binding from the three sketched options. Every method routes to the calling thread, looked up through CapCallContext::caller_thread. The kernel cap object remains zero-sized (SchedulingPolicyCap); routing moved from call to call_with_context so the dispatch path sees the caller’s ThreadRef. There is no per-cap-object ThreadHandle, no badge-encoded thread id, and no cross-thread or cross-process mutation in this slice; per-cap-object target references and badge-encoded thread ids are reserved for the Phase H privileged scheduler policy service that will need cross-thread authority. Today the manifest grant path therefore authorizes the holder’s own threads in the strict sense – a holder cannot reach another thread’s weight or latency_class through this cap):

enum LatencyClass {
    interactive @0;
    normal      @1;
    batch       @2;
    ipcServer   @3;
}

interface SchedulingPolicyCap {
    setWeight @0 (weight :UInt16) -> ();
    setLatencyClass @1 (class :LatencyClass) -> ();
    snapshot @2 ()
        -> (weight :UInt16, class :LatencyClass,
            runtimeNs :UInt64, virtualRuntimeNs :UInt64);
}

The snapshot return is intentionally narrow: the four fields it exposes (weight, class, runtimeNs, virtualRuntimeNs) are the ones the WFQ slice promotes out of cfg(feature = "measure") unconditionally. The benchmark-only counters (context_switches, preemptions, voluntary_blocks, migrations) stay behind the measure feature because they are not load-bearing for ordering and remain useful only for benchmark instrumentation; a future operator-observability slice can add them to a separate snapshot cap once a non-emergency-path storage and reporting surface exists.

Authority rules:

  • setWeight and setLatencyClass are kernel-checked: an SQE invocation must carry a live SchedulingPolicyCap. The methods carry no per-call ThreadHandle; the target binding (selected in Phase D Task 2) is the context-derived caller-thread fallback: the kernel routes through CapCallContext::caller_thread, so a holder can only mutate its own running thread by construction. If a future cross- process grant lets a holder invoke the cap without authority over its bound target, the call fails closed through the standard cap-revocation transport-error path (the disconnected-class CapException produced by the ring dispatcher when the cap is revoked or stale); the ExceptionType taxonomy has no Denied variant by design.
  • setWeight validates the input at the cap boundary, not at the dispatch path. The validation rule is: weight = 0 (which would make the WFQ derivation slice_ns * REFERENCE_WEIGHT / weight divide by zero) is rejected with CapException::InvalidArgument; any nonzero value outside [MIN_WEIGHT, MAX_WEIGHT] (Phase D constants) is also rejected with CapException::InvalidArgument. The kernel does not silently clamp out-of-range values, because a silent clamp masks caller bugs and hides cap-boundary policy from the audit surface. The invalidArgument variant landed in ExceptionType with Phase D Task 1 (commit cb8c58b1, 2026-05-07); the updated client-response taxonomy is in docs/proposals/error-handling-proposal.md.
  • The bootstrap SchedulingPolicyCap is granted by manifest only. Its initial domain is Self (the holder’s own threads). Wider authority (cross-process weight/class mutation) belongs to the Phase H privileged scheduler policy service; Phase D does not promise that grant in the default boot manifest. Phase D manifests grant only the focused-proof scope needed for the test-matrix smokes.
  • Default policy: a thread without any explicit cap-driven mutation carries weight = DEFAULT_WEIGHT and latency_class = LatencyClass::Normal. Behavior with all defaults must preserve the pre-Phase-D default workload behavior at the limit (no fairness regressions for unmodified workloads).
  • Stale-cap revoke: SchedulingPolicyCap mutations carry the generation/epoch model used elsewhere. A weight change submitted after the cap is revoked fails closed; partially applied changes on a thread that exits between SQE arrival and dispatch fail with the standard Stale outcome and do not leak weight state.
  • The cap surface is a single typed interface; restriction is by granting a narrower wrapper (e.g., SchedulingPolicyCap whose authority domain is exactly one ThreadHandle). The kernel does not carry a parallel rights bitmask.

Latency-class semantics for Phase D (pinned mapping):

  • LatencyClass::Normal is the baseline; weight alone determines the WFQ share. The selected slice_ns is the Phase D default quantum.
  • LatencyClass::Interactive reduces the per-enqueue slice contribution by a Phase D constant (INTERACTIVE_SLICE_DIVISOR; Phase D Task 2 ships 2): the WFQ derivation becomes vruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) * REFERENCE_WEIGHT / weight. This places the entity earlier in the per-CPU queue on each enqueue, so a short-sleeper that wakes on a Timer completion runs ahead of a same-weight CPU hog within the same scheduling window. The cumulative share is unchanged because vruntime accounting still advances at elapsed_ns * REFERENCE_WEIGHT / weight; the class only affects the per-enqueue tag, not the runtime-charge step.
  • LatencyClass::Batch increases the per-enqueue slice contribution by a Phase D constant (BATCH_SLICE_MULTIPLIER; Phase D Task 2 ships 4): the derivation becomes vruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT / weight. This places the entity later in the per-CPU queue on each enqueue, so a CPU hog at LatencyClass::Batch yields wake-to- run latency to LatencyClass::Normal and LatencyClass::Interactive siblings without losing its weighted share over a long window.
  • LatencyClass::IpcServer is treated identically to LatencyClass::Normal for the WFQ ordering tag in this slice. The class exists in the ABI so a Phase H policy service can later re-bind direct-IPC preference, server affinity, or scheduling-context donation rules without an ABI break; Phase D does not change the existing direct-IPC preference slot semantics for this class.
  • The class is stored on Thread and read at every enqueue. A class change through setLatencyClass is observed on the next enqueue (next dequeue + re-enqueue, or next wake from blocked). No retroactive recomputation of an in-queue tag.

Phase D does not build the userspace policy service (Phase H). It adds the kernel-side primitive that Phase H will consume. SchedulingContext (Phase E) is a separate authority for budget/period/CPU mask; weight/latency-class is the WFQ ordering knob, not CPU-time authority. The two cap surfaces stay disjoint.

Phase D migration fairness sketch

A thread migrating from CPU A to CPU B mid-quantum must preserve its share. Rules:

  • virtual_runtime_ns is per-thread, not per-CPU. It travels with the thread on every migration. The accounting record already encodes that (ThreadCpuAccounting.virtual_runtime_ns lives on Thread, not on a CPU slot). Phase D promotes that field out of cfg(feature = "measure") and changes the charge_runtime step so the field advances by elapsed_ns * REFERENCE_WEIGHT / weight rather than 1:1 with elapsed time; the migration contract is otherwise unchanged.
  • Per-CPU local clocks are not used as a vruntime reference. The scheduler reads the global monotonic clocksource through crate::arch::context::monotonic_ns(), the same source the unconditional runtime/vruntime ledger uses. There is no per-CPU clock offset because there is no per-CPU vruntime reference.
  • virtual_finish_ns is recomputed at enqueue on the destination CPU from the destination weight, not carried as committed state. The migration step is remove-from-source, recompute, insert-at-destination; the scheduler lock is held for the whole window.
  • Cross-CPU steal: a CPU whose local queue has no runnable entry walks sibling per-CPU queues. For each sibling queue the scan walks indices ascending and stops at that queue’s first entry the destination CPU considers Runnable; because each queue is ordered ascending by virtual_finish_ns, the first Runnable hit per queue is the lowest virtual_finish_ns candidate the destination can accept on that source. The steal target is then the source queue whose first-Runnable candidate has the lowest virtual_finish_ns globally — the same fair-share rule the local pick uses (most overdue first) — with ties broken by lower CPU id. The chosen entry is removed from its actual position on the source queue (not necessarily the head: a RetryLater or single-CPU-owner thread may sit at the front and stay there); the destination recomputes virtual_finish_ns and inserts at the destination ordered position. The steal is allocation-free because both queues are pre-reserved against the live runnable count.
  • The ThreadCpuAccounting.migrations counter is incremented on each cross-CPU enqueue, both for placement-time spread and for steal. The behavior mirrors the prior pre-collapse counter; the Phase D slice keeps it under cfg(feature = "measure") until a permanent operator snapshot path lands.

The one-bisect-cycle single-global-queue fallback has been retired before Phase E. The accepted Phase D behavior is now always the per-CPU WFQ queue shape described above.

Phase D test matrix

Workload shapes the implementation slice verified before close:

  • CPU hogs (existing make run-thread-scale). Equal-weight same-process threads must split CPU share within bench tolerance. Different-weight threads must split CPU share approximately in proportion to weights (e.g., weights 2:1 → roughly 2:1 runtime ratio). Phase D manually accepted the recorded 1-to-4 diagnostic at 3.088x work speedup versus the recorded 1.566x baseline.
  • Short sleepers. Threads that block on Timer.sleep for short intervals must preempt CPU hogs within one quantum’s worth of bound after wake. Latency-class Interactive should have lower observed wake-to-run latency than latency-class Batch. Phase D closed this with focused make run-thread-fairness and make run-thread-fairness-interactive QEMU smokes.
  • Direct IPC server/client pairs (existing make run-spawn). An IPC server thread woken by an endpoint CALL must keep paired-call timing comparable to the current direct-IPC handoff. The direct-IPC preference slot must keep its existing generation-checked semantics under WFQ; a server should not starve when the global vruntime advances on other CPUs.
  • Multi-process load (existing make run-smp-process-scale). Independent worker processes with default weights must preserve the recorded 2026-04-30 1.6x 1-to-2 gate. WFQ across processes (no shared address space) must not regress that proof.
  • Same-process sibling load. This is the same workload shape as make run-thread-scale; it doubles as the per-CPU-queue reintroduction proof.

The exact historical per-workload acceptance numbers live in docs/backlog/scheduler-evolution.md.

Phase D overload behavior

Soft overload (runnable entities × weight exceeds the selected CPU set’s capacity):

  • Each entity gets less than its weighted share. No entity is starved; vruntime ordering guarantees that the most-behind thread runs next.
  • The scheduler does not refuse to enqueue. Phase D’s WFQ does not implement strict admission; that belongs to Phase E (SchedulingContext budget/period) and Phase G (RealtimeIsland admission).

Hard overload (e.g., a RealtimeIsland admission attempt that collides with an active CpuIsolationLease):

  • Use the existing isolation/admission path; Phase D defers to Phase F’s CpuIsolationLease and Phase G’s RealtimeIsland for that behavior. WFQ continues to schedule best-effort work on the housekeeping CPU set.
  • If an isolation lease holds CPU N and N has runnable best-effort work that cannot migrate (e.g., bound by manifest pinning), the lease attempt fails closed; existing CPU-mask validation remains the gate. Phase D does not introduce new pinning policy.

Strict admission, deadline overrun, and budget depletion are explicitly out of scope for Phase D and stay in Phase E/G.

Stage 4: Scheduling Contexts

CPU-time authority becomes a capability. SchedulingContext records budget, period, relative deadline, priority or criticality, CPU mask, remaining budget, replenishment state, timeout endpoint, and overrun policy.

The landed Phase E slices remain narrower than the full target above. The ABI now has SchedulingContextSpec authority inputs for budgetNs, periodNs, relativeDeadlineNs, byte-oriented cpuMask, and overrunPolicy, plus a read-only SchedulingContextInfo snapshot with context identity, lifecycle state, binding state, remaining budget, and an explicit dispatch-effect label. SchedulingContext.info() remains method id 0. SchedulingContext.create() creates a same-interface result cap for a validated spec, bindCallerThread() records one caller-thread binding for the current generation, and revoke() advances the generation and clears the matching thread metadata binding. Bootstrap-granted contexts and contexts returned by create() draw from the same non-wrapping context-id allocator, so the (contextId, generation) binding key does not alias distinct cap objects.

Bound active contexts now install a fixed per-thread dispatcher budget ledger: runtime charge decrements remainingBudgetNs, runnable selection replenishes elapsed periods, and exhausted contexts remain queued but ineligible until the next replenishment period. The effect label is budgetEnforced for active contexts and stays infoOnlyNoDispatchChange for stale/revoked fail-closed paths. Deadline-driven accounting now arms a sub-tick budget-exhaustion one-shot when the selected thread’s remaining budget would deplete before the next periodic scheduler tick, and nohz re-arm folds the leased thread’s budget deadline into its existing nearest-deadline timer. Kernel-mode budget one-shot fires restore a live periodic timer before returning to kernel code, so the ordinary and tick-masked paths no longer rely on a full tick quantum to observe budget depletion. Synchronous endpoint donation/return now covers passive receiver threads: endpoint in-flight state carries an internal donation token, receiver runtime charges to the caller-donated context, RETURN, application-exception RETURN, or invalid-result RETURN restores the reduced budget to the caller before caller wake, a donor with an in-flight token is blocked from returning to userspace until RETURN/cancel using an atomic marker-to-block transition that treats already-returned fast paths as normal completion, and nested donation of an already donated context is rejected until stacked return tokens have a dedicated design. Timeout/depletion notifications now use fixed per-context cells allocated at context creation/bootstrap. The cells coalesce budget-depleted and deadline-or-timeout events with typed sequence/count metadata, holder identity, remaining budget, next timestamp, donated-holder marking, explicit-revoke lifecycle state, and ok/revoked/staleGeneration observer results through SchedulingContext.drainNotifications(). Notification publishing does not allocate in scheduler hard paths, publish result caps, append unbounded queues, donate budget, reorder runnable entities, bypass throttling, or imply nohz behavior. A pre-armed observer waiter/wakeup path, realtime admission, SQPOLL, nohz, and CPU placement enforcement remain future work. Stale caps report staleGeneration and cannot mutate the new generation’s scheduler metadata or budget ledger; revoked contexts report revoked. Ordinary non-donated session logout now uses the same stale-generation rule: after UserSession.logout() flips the liveness cell, the scheduler removes matching non-donated bound thread contexts and marks the old cap generation stale. The focused session-context proof covers stale info, bindCallerThread, create, revoke, and notification-drain behavior without result-cap publication or metadata mutation. Donated receiver logout keeps the conservative skip policy: if logout observes a receiver thread holding an endpoint-donated context, the hook counts the skipped donated binding and leaves the donor blocked until endpoint RETURN/cancel commits cleanup. The focused session-context proof covers the RETURN case by showing the receiver logs out while holding the donation, the donor stays blocked, the hook reports donation_inflight_skipped=1, and the caller observes a bound context with reduced remaining budget after RETURN rather than fresh budget. Clean local owner-shell exit now calls the held UserSession.logout() before process exit, and the shell smoke observes the same scheduler hook with no bound local shell SchedulingContext.

cpuMask is a canonical little-endian bitset. CPU n maps to bit n % 8 of byte n / 8, with bit 0 as the least-significant bit of each byte. Empty data means no CPUs are selected, not “all CPUs”; future admission/bind validation rejects empty masks for runnable contexts. Producers omit trailing zero bytes: the all-zero set is encoded as empty, and any non-empty canonical mask has a nonzero final byte. This slice only snapshots that shape and does not enforce placement from it.

Remaining kernel responsibilities:

  • prevent a thread without eligible CPU authority from running;
  • charge runtime to exactly one authority target;
  • add any pre-armed timeout/depletion observer wake path without allocating in emergency paths.

Policy-service responsibilities:

  • admit or reject scheduling contexts;
  • choose budget/period/priority;
  • bind contexts to threads/services;
  • revoke or adjust contexts safely;
  • record operator-visible decisions.

SQE.deadline_ns remains request metadata. It may influence drop, freshness, propagation, and telemetry, but it does not grant CPU budget.

Stage 5: CPU Isolation Leases and SQPOLL

CpuIsolationLease grants placement and exclusivity, not CPU time. It records the owner process/session/service, CPU set, mode, housekeeping exclusions, accounting target, maximum revocation latency, and revoke endpoint. The current Phase F implementation keeps ticks periodic but makes housekeeping/deferred-work placement explicit: at least one online scheduler housekeeping CPU must remain outside active lease candidates, and preflight telemetry routes or rejects deferred cleanup, timer/deadline, network polling, IRQ affinity, scheduler accounting, and cleanup latency before later SQPOLL or nohz behavior can use the lease.

The Phase F substrate landed so far is:

  • the one-SQ-consumer ring-ownership prerequisite that lets nohz/SQPOLL reason about a single submission consumer per ring;
  • nohz activation telemetry that labels admit/reject decisions, rollback reasons, and current periodic-tick fallback state without changing dispatch behavior;
  • housekeeping/deferred-work placement preflight, which fail-closes when unrelated timers, deferred cleanup, network polling, debug/watchdog work, or IRQ delivery would otherwise be pinned to a candidate isolated CPU;
  • a bounded SQPOLL ring-mode worker (MAX_SQPOLL_WORKERS = 16) that records tick_suppression=disabled / full_nohz=disabled strings while the activation proof is still open, with generation-checked stale-owner rollback;
  • a clockevent/deadline substrate independent of the periodic tick, so the scheduler can express “wake at deadline T” without depending on periodic ticks to enforce budget;
  • a bounded non-periodic SQPOLL producer-wake progress path that lets a parked SQPOLL worker make forward progress on producer activity without reverting to a periodic tick.

Automatic nohz activation – actually suppressing the periodic scheduler tick on an admitted CPU and restoring it on rollback/revoke/stale generation – was closed for the first increment via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md: the CpuIsolationLease preflight now performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window, satisfying proof obligations for single runnable entity on the target CPU, ready housekeeping CPU outside the lease, non-local deferred-cleanup/timer/network/IRQ dependencies, valid accounting target, bounded revocation latency, and generation-checked ring ownership, with fail-closed rollback. SQPOLL-driven auto-nohz activation is also closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md: a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL running/sleeping mode with a live owner is admitted for tick suppression, with the SQPOLL ring-state re-check as the decisive rollback gate. The tick_suppression, auto_nohz, and sqpoll telemetry counters reflect real suppression. Generic full-nohz for ordinary budgeted compute threads is now admitted by explicit SchedulingContext-targeted CpuIsolationLease preflight; production realtime island admission remains deferred independently of these closed tasks.

Activation requires scheduler proof:

  • at least one housekeeping CPU remains online;
  • unrelated timers, deferred cleanup, network polling, and debug/watchdog work are not pinned to the isolated CPU;
  • the active ring has exactly one SQ consumer;
  • the accounting target is valid and chargeable;
  • revocation latency fits the lease policy.

The scheduler idle path is now a per-CPU CPL0 (kernel-mode) idle thread; the user-mode idle process was removed in commit e3c0df01 (2026-05-14 UTC). There are two CPL0 idle paths: the cooperative boot/AP path that hlts at CPL0 on the per-CPU kernel stack, and the steady-state idle-thread path reached from the four dispatch sites (schedule, capos_block_current_syscall, exit_current, exit_current_thread). Both are described in detail in Scheduling.

SQPOLL uses the ring-mode contract in Tickless and Realtime Scheduling. The scheduler proposal adds the CPU-ownership and policy-service side of that contract.

Stage 6: Realtime Islands

A RealtimeIsland is an admitted graph, not a single priority. It records scheduling contexts, memory reservations, device and IRQ reservations, rings/endpoints/notifications, any CPU isolation leases, admission evidence, and overrun/shutdown policy.

Use cases include local audio, realtime voice, robotics control, and selected provider/runtime loops. Admission must fail closed if the graph cannot fit the declared period/quantum and reservations.

Stage 7: User-Space Scheduler Policy

After kernel primitives are in place, a privileged scheduler policy service can own:

  • default resource profiles;
  • session/account/service CPU policy;
  • scheduling-context admission;
  • CPU lease grant/revoke;
  • runtime hints such as latency-sensitive, batch, driver, poller, or agent;
  • AutoNoHz placement for ordinary threads that appear capable of utilizing a full CPU core (see Policy-Service Userstories in tickless-realtime-scheduling-proposal);
  • operator-facing diagnostics and policy reload.

AutoNoHz placement is the policy-service surface that turns the “thread appears capable of utilizing a full CPU core” observation into a bounded CpuIsolationLease against a pre-authorized account or session CPU pool. The lease adds isolation; it does not mint CPU-time authority. The thread still consumes time through its existing SchedulingContext (or coarse ResourceLedger); the lease just removes tick and scheduler noise while that budget is being consumed. Bounds the policy service must enforce on every auto-issued lease – lifetime, revocation latency, accounting target, auto-claim pool capacity, and fairness preemption – are detailed in the tickless proposal.

The kernel still owns emergency fallback. If the policy service is dead, blocked, stale, or malicious, the kernel must continue to enforce safety, revoke leases as policy permits, and schedule a minimal recovery path.

Validation Gates

  • Per-CPU queue work must preserve run-smoke, run-spawn, run-thread-scale, park/ring/process-exit smokes, and SMP smokes.
  • A thread-scale milestone closeout must include repeated controlled capos-bench evidence and raw logs.
  • CPU accounting must include sanity tests that measured runtime increases monotonically while a thread runs and stops while it is blocked.
  • Fair policy changes must include adversarial tests: CPU hogs, short sleepers, direct IPC handoff, multi-process load, and same-process sibling load.
  • Scheduling-context work must include admission rejection, budget depletion, replenishment, endpoint donation/return, timeout notification, stale cap revocation tests, and any future pre-armed notification waiter coverage.
  • CPU leases must include revocation, process exit, session close, and housekeeping fallback tests.
  • Realtime island proofs must show preallocation, no allocation/blocking on admitted paths, deadline miss telemetry, and fail-closed overrun behavior.

Open Decisions

  • Whether the first best-effort fair policy should be weighted fair queueing or direct EEVDF. Resolved 2026-05-05 19:00 UTC: WFQ first; EEVDF deferred follow-on. See “Phase D first-policy decision” above.
  • Whether scheduling-context priority is a scalar, a criticality band, or both.
  • Whether SchedulingContext should be bindable to a process default, individual thread, endpoint call path, or all three in the first ABI.
  • Which scheduler telemetry is permanent ABI and which is benchmark-only.
  • How much policy-service state belongs in the boot manifest versus mutable operator configuration.
  • Whether the WFQ slice’s bucketed VecDeque per-CPU queue is the long-term representation or a stepping stone to an EEVDF BTreeMap-based eligibility set. EEVDF is an evaluated follow-on policy, not a committed migration; re-evaluate only when the explicit Phase D follow-on EEVDF migration backlog item is selected. Phase F’s one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, bounded SQPOLL ring mode, clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress have landed on top of the closed Phase E SchedulingContext gate; the first automatic nohz activation increment is also closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md and SQPOLL-driven auto-nohz activation is closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed. The policy-service AutoNoHz capstone and generic SQPOLL nohz for arbitrary rings remain open. Phase F.5 (full-SMP 16/32-core scalability) is still planning.