Research: Future Scheduler Architecture
This note records the prior art checked for future capOS scheduling work after the first SMP and per-thread ring milestones exposed that scheduler structure, not only timer programming, will decide whether capOS scales.
Local Grounding
Existing capOS documents already cover part of the answer:
- Scheduling: scheduler architecture including
Phase D WFQ policy, Phase E
SchedulingContext, Phase FCpuIsolationLeaseand CPL0 idle thread, LAPIC timer/IPI foundation, global run queue, per-thread rings, timer sleep waiters, park waiters, direct IPC handoff, and SMP state. - SMP Phase C: active multi-CPU execution and in-process thread-scaling proof work.
- SMP: accepted SMP direction.
- Ring v2 For Full SMP: per-thread completion ownership and SQPOLL preconditions.
- Tickless and Realtime Scheduling: tickless idle, SQPOLL nohz, deadlines, scheduling contexts, donation, and realtime islands.
- Stateful Task and Job Graphs: durable work graphs, assignment metadata, graph-run state, and the stop line that graph coordinators must not become authority-holding god objects.
- NO_HZ, SQPOLL, and Realtime Scheduling: Linux
NO_HZ, clocksource/clockevent, CPU isolation, SQPOLL,
SCHED_DEADLINE, PREEMPT_RT, and seL4 MCS grounding. - Out-of-kernel scheduling: kernel mechanism versus user-space policy split.
- Completion rings and threaded runtimes:
completion ownership,
io_uring, futex, and IOCP lessons. - Multimedia pipeline latency and Robotics realtime control: admitted realtime-island use cases.
External Sources Checked
- Linux kernel documentation, CFS Scheduler.
- Linux kernel documentation, EEVDF Scheduler.
- Linux kernel documentation, Deadline Task Scheduling.
- Linux kernel documentation, Extensible Scheduler Class.
- Linux kernel documentation, NO_HZ: Reducing Scheduling-Clock Ticks.
- Linux kernel documentation, CPU Isolation.
- Linux kernel documentation, Housekeeping.
- Linux kernel documentation, PREEMPT_RT theory of operation.
- FreeBSD manual, ULE scheduler.
- seL4 documentation, MCS Extensions tutorial.
- Anderson, Bershad, Lazowska, and Levy, Scheduler Activations.
- Ghosh et al., ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling.
- Ousterhout et al., Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads.
- Fried et al., Caladan: Mitigating Interference at Microsecond Timescales.
- Kaffes et al., Shinjuku: Preemptive Scheduling for microsecond-scale Tail Latency.
- Qin et al., Arachne: Core-Aware Thread Management.
Findings
Fair General-Purpose Scheduling
Linux CFS established the now-common model that ordinary tasks should be ordered by virtual runtime, not by a fixed time-slice list. Linux EEVDF keeps the fair-scheduler lineage but chooses the eligible task with the earliest virtual deadline, using request size and lag to improve latency and fairness.
The capOS consequence is not “import Linux CFS.” It is:
- ordinary best-effort work should use virtual-time accounting (Phase D WFQ is now the active policy; the earlier FIFO round-robin was the bootstrap);
- latency-sensitive best-effort work should have bounded, policy-visible request sizes or weights rather than hidden scheduler magic;
- per-CPU run queues are a prerequisite before any EEVDF-like policy matters at SMP scale.
EEVDF is the strongest candidate for the next capOS best-effort policy evolution after WFQ. It should follow WFQ rather than replace it immediately because it depends on accurate runtime charging, per-CPU runnable ownership, and migration accounting that are not yet in place.
Per-CPU Run Queues and Topology
Linux and FreeBSD both make per-CPU scheduler state the normal SMP unit. FreeBSD ULE additionally exposes CPU topology and affinity as first-class placement concerns. This matches the current capOS scaling evidence: one global scheduler lock and one global run queue make every CPU contend on the same state even after per-thread rings remove the process-wide CQ bottleneck.
The near-term capOS scheduling architecture should split:
- per-CPU current thread and run queue ownership;
- cross-CPU wakeup and migration paths;
- shared process/thread metadata protected by narrower locks;
- placement policy from dispatch mechanism;
- diagnostic counters for lock hold/spin time, migration, steals, and IPIs.
Realtime and Temporal Isolation
Linux SCHED_DEADLINE uses EDF plus Constant Bandwidth Server-style budget,
deadline, and period parameters. Its key lesson for capOS is admission:
deadline scheduling without bandwidth control is only a priority policy, not
a guarantee.
seL4 MCS is the more capability-native precedent. CPU time is represented by scheduling-context objects. Passive servers can run on caller-donated CPU time, avoiding priority inversion across synchronous IPC. This maps directly to capOS endpoint services and direct IPC handoff.
The capOS split should remain:
SQE.deadline_ns: request freshness and propagation metadata;SchedulingContext: spendable CPU-time authority;- donation: temporary transfer of CPU budget/deadline along a synchronous capability path;
RealtimeIsland: admitted bundle of scheduling contexts, memory/device reservations, communication paths, and overrun policy.
Tickless, Isolation, and Housekeeping
Linux NO_HZ and CPU isolation reinforce that tick suppression is not one feature. Idle tickless is a timer cleanup. Full-nohz is an isolation contract that also needs housekeeping CPUs, accounting, timer migration, deferred work placement, and revocation latency policy.
For capOS, this grounding shaped the implementation order: automatic nohz
activation for the narrow single-runnable-entity window and SQPOLL-driven
auto-nohz for ring-coupled leases are now implemented (Phase F), both tied
to the CpuIsolationLease with housekeeping, deferred-work placement,
clockevent deadline substrate, one-SQ-consumer ownership, and fail-closed
rollback prerequisites satisfied first. Generic full-nohz for explicitly
budgeted compute leases, timeout-based auto-revoke, and SQPOLL nohz for
explicitly leased caller-thread rings have since landed. Remaining future work
includes:
- full-nohz tied to policy-service issuance and durable monitoring telemetry;
- SQPOLL nohz beyond the current caller-thread ring-coupled lease shape;
- realtime island nohz after admission proves unrelated work, IRQs, deferred frees, and timers are excluded or bounded.
Pluggable and User-Space Policy
Linux sched_ext and ghOSt show that fast scheduler experimentation is useful, but they also preserve privileged dispatch and enforcement. sched_ext runs BPF inside the kernel scheduler framework with fallback; ghOSt delegates policy to user-space agents while retaining kernel mechanisms for safety and preemption.
For capOS, the safe architecture is:
- keep dispatch, budget enforcement, interrupt handling, idle, and fallback in the kernel;
- expose policy knobs through capabilities;
- let a privileged scheduler-policy service own admission, budget selection, CPU partitioning, isolation leases, and tuning;
- call the policy service on configuration changes, depletion/timeout faults, and coarse placement events, not on every context switch.
Dynamic policy loading is a later experiment. It should not become the first way to make basic SMP scheduling scale.
Datacenter Runtime Schedulers
Shenango, Caladan, Shinjuku, and Arachne target microsecond-scale service latency by managing cores, preempting long request handlers, and separating fast user-level scheduling from coarser kernel control. They are useful because capOS will host services, agent runtimes, and network stacks that want low tail latency.
The shared lessons are:
- core grants are different from CPU-time budgets;
- user-level worker schedulers need kernel-visible blocking and preemption boundaries;
- tail-latency policies need request-level telemetry, not only thread-level CPU shares;
- cross-core coordination must be cheap enough that it does not dominate the service latency it tries to reduce.
For capOS this argues for scheduler hints and policy capabilities above the kernel mechanism, not a datacenter-specific kernel scheduler as the default.
Stateful Work Graphs
The stateful task/job graph proposal is related at the workload layer. A graph node can carry assignment metadata such as priority, deadline, budget, queue, and lease, and a domain coordinator can decide which node attempt is runnable inside that graph. That is not the same authority as kernel CPU scheduling.
The scheduler consequence is a clean boundary:
- graph/node priority is domain policy until translated by an authorized scheduler policy service;
- graph budgets reference resource profiles or scheduling contexts, but do not mint CPU time by themselves;
- graph deadlines may create request deadlines or admission inputs, but do not bypass scheduler admission;
- build, init, agent, and operator graph coordinators should lease work and consume scheduler primitives rather than owning a global CPU run queue;
- scheduler telemetry should be attachable to graph runs as artifacts, so a failed or slow job can explain whether it waited on authority, CPU budget, dependency state, I/O, or policy.
Recommended capOS Direction
- (Done: Phase D/E/F) Finish the current thread-scale evidence before
larger policy changes. Phase D WFQ, Phase E
SchedulingContext, and Phase FCpuIsolationLease/ auto-nohz / SQPOLL-coupled nohz have landed. - Split scheduler state into per-CPU runnable ownership and bounded cross-CPU wake/migration. (Per-CPU queues remain future work; Phase F.5.)
- Add precise CPU accounting and scheduler attribution before changing the default policy. (Attribution guardrails landed in Phase A; full per-CPU accounting is Phase F.5 follow-on.)
- Move ordinary best-effort work toward an EEVDF-like virtual-deadline policy after accounting and per-CPU queues exist. (WFQ is current; EEVDF is a follow-on evaluation deferred until per-CPU queues exist.)
- Keep
SCHED_DEADLINE/EDF-CBS and seL4 MCS as the precedent for admitted realtime work, but express CPU authority as capOSSchedulingContextcapabilities. (SchedulingContextis implemented;RealtimeIslandadmission is Phase G future work.) - Keep user-space scheduler policy coarse-grained and capability-authorized; do not consult a user process on every timer interrupt or dispatch.
- Treat SQPOLL, busy polling, and full-nohz as CPU-isolation leases with housekeeping and revocation constraints. (Ring-coupled SQPOLL nohz and generic full-nohz for explicitly budgeted compute leases are implemented; policy-service issuance remains future work.)
- Keep runtime schedulers above per-thread rings, futex/park/notification primitives, timers, and explicit thread objects.
The resulting target is a layered scheduler:
- Kernel dispatch/enforcement: per-CPU queues, context switch, idle, accounting, budget enforcement, timeout faults, direct IPC donation, and cross-CPU wake/migration.
- Kernel policy primitives: weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, and realtime-island admission hooks.
- Userspace policy: profiles, admission, budget selection, service/runtime hints, placement, diagnostics, and policy reload.
- Userspace runtimes: work stealing, actor queues, async reactors, service request schedulers, and language-specific M:N scheduling.
Open Questions
The questions below have been answered by Phase D/E/F implementation; they are kept for record and context:
- Answered (Phase D): WFQ is the first virtual-time policy; EEVDF is deferred until per-CPU queues and runtime charging exist.
- Answered (Phase D): The thread-scale milestone did not require a per-CPU queue split; WFQ on a global queue with per-thread weight sufficed.
- Answered (Phase E): The initial
SchedulingContextABI usesSchedulingContextSpec(weight, latency class, budget, period, overrun policy) andSchedulingContextInfofor info-only read access;SchedulingContext.info()is method id 0 for stability. Donation/return through endpoints is implemented; realtime island admission is future work. - Answered (Phase F):
CpuIsolationLeaserevocation interacts with session logout and process exit through lease-generation staling, which is the load-bearing rollback trigger; service replacement and process exit cleanup go through the same generation-staling path.
Remaining open questions:
- What is the minimum per-CPU queue split that closes the full-SMP scalability milestone (Phase F.5) without prematurely designing the full fair scheduler?
- How should policy-service issuance select and renew generic full-nohz and SQPOLL nohz leases beyond the current explicit local proofs?
- Which scheduler telemetry belongs in the always-on kernel and which belongs
behind the benchmark-only
measurefeature? - What is the right
RealtimeIslandadmission shape for admitted scheduling contexts, memory/device reservations, and overrun policy (Phase G)?