Research: Future Scheduler Architecture
This note records the prior art checked for future capOS scheduling work after the first SMP and per-thread ring milestones exposed that scheduler structure, not only timer programming, will decide whether capOS scales.
Local Grounding
Existing capOS documents already cover part of the answer:
- Scheduling: current round-robin scheduler, LAPIC timer/IPI foundation, global run queue, per-thread rings, timer sleep waiters, park waiters, direct IPC handoff, and current SMP limitations.
- SMP Phase C: active multi-CPU execution and in-process thread-scaling proof work.
- SMP: accepted SMP direction.
- Ring v2 For Full SMP: per-thread completion ownership and SQPOLL preconditions.
- Tickless and Realtime Scheduling: tickless idle, SQPOLL nohz, deadlines, scheduling contexts, donation, and realtime islands.
- Stateful Task and Job Graphs: durable work graphs, assignment metadata, graph-run state, and the stop line that graph coordinators must not become authority-holding god objects.
- NO_HZ, SQPOLL, and Realtime Scheduling: Linux
NO_HZ, clocksource/clockevent, CPU isolation, SQPOLL,
SCHED_DEADLINE, PREEMPT_RT, and seL4 MCS grounding. - Out-of-kernel scheduling: kernel mechanism versus user-space policy split.
- Completion rings and threaded runtimes:
completion ownership,
io_uring, futex, and IOCP lessons. - Multimedia pipeline latency and Robotics realtime control: admitted realtime-island use cases.
External Sources Checked
- Linux kernel documentation, CFS Scheduler.
- Linux kernel documentation, EEVDF Scheduler.
- Linux kernel documentation, Deadline Task Scheduling.
- Linux kernel documentation, Extensible Scheduler Class.
- Linux kernel documentation, NO_HZ: Reducing Scheduling-Clock Ticks.
- Linux kernel documentation, CPU Isolation.
- Linux kernel documentation, Housekeeping.
- Linux kernel documentation, PREEMPT_RT theory of operation.
- FreeBSD manual, ULE scheduler.
- seL4 documentation, MCS Extensions tutorial.
- Anderson, Bershad, Lazowska, and Levy, Scheduler Activations.
- Ghosh et al., ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling.
- Ousterhout et al., Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads.
- Fried et al., Caladan: Mitigating Interference at Microsecond Timescales.
- Kaffes et al., Shinjuku: Preemptive Scheduling for microsecond-scale Tail Latency.
- Qin et al., Arachne: Core-Aware Thread Management.
Findings
Fair General-Purpose Scheduling
Linux CFS established the now-common model that ordinary tasks should be ordered by virtual runtime, not by a fixed time-slice list. Linux EEVDF keeps the fair-scheduler lineage but chooses the eligible task with the earliest virtual deadline, using request size and lag to improve latency and fairness.
The capOS consequence is not “import Linux CFS.” It is:
- the current FIFO round-robin run queue is only a bootstrap policy;
- ordinary best-effort work should eventually use virtual-time accounting;
- latency-sensitive best-effort work should have bounded, policy-visible request sizes or weights rather than hidden scheduler magic;
- per-CPU run queues are a prerequisite before any EEVDF-like policy matters at SMP scale.
EEVDF is the strongest candidate for capOS default best-effort policy once CPU accounting exists. It should not be the first scheduler refactor because it depends on accurate runtime charging, per-CPU runnable ownership, and migration accounting.
Per-CPU Run Queues and Topology
Linux and FreeBSD both make per-CPU scheduler state the normal SMP unit. FreeBSD ULE additionally exposes CPU topology and affinity as first-class placement concerns. This matches the current capOS scaling evidence: one global scheduler lock and one global run queue make every CPU contend on the same state even after per-thread rings remove the process-wide CQ bottleneck.
The near-term capOS scheduling architecture should split:
- per-CPU current thread and run queue ownership;
- cross-CPU wakeup and migration paths;
- shared process/thread metadata protected by narrower locks;
- placement policy from dispatch mechanism;
- diagnostic counters for lock hold/spin time, migration, steals, and IPIs.
Realtime and Temporal Isolation
Linux SCHED_DEADLINE uses EDF plus Constant Bandwidth Server-style budget,
deadline, and period parameters. Its key lesson for capOS is admission:
deadline scheduling without bandwidth control is only a priority policy, not
a guarantee.
seL4 MCS is the more capability-native precedent. CPU time is represented by scheduling-context objects. Passive servers can run on caller-donated CPU time, avoiding priority inversion across synchronous IPC. This maps directly to capOS endpoint services and direct IPC handoff.
The capOS split should remain:
SQE.deadline_ns: request freshness and propagation metadata;SchedulingContext: spendable CPU-time authority;- donation: temporary transfer of CPU budget/deadline along a synchronous capability path;
RealtimeIsland: admitted bundle of scheduling contexts, memory/device reservations, communication paths, and overrun policy.
Tickless, Isolation, and Housekeeping
Linux NO_HZ and CPU isolation reinforce that tick suppression is not one feature. Idle tickless is a timer cleanup. Full-nohz is an isolation contract that also needs housekeeping CPUs, accounting, timer migration, deferred work placement, and revocation latency policy.
For capOS:
- tickless idle should stay the first timer milestone;
- full-nohz should be tied to a
CpuIsolationLease; - SQPOLL nohz should require one SQ consumer, per-thread rings, explicit housekeeping, and network polling placement;
- realtime islands may use nohz only after admission proves that unrelated work, IRQs, deferred frees, and timers are excluded or bounded.
Pluggable and User-Space Policy
Linux sched_ext and ghOSt show that fast scheduler experimentation is useful, but they also preserve privileged dispatch and enforcement. sched_ext runs BPF inside the kernel scheduler framework with fallback; ghOSt delegates policy to user-space agents while retaining kernel mechanisms for safety and preemption.
For capOS, the safe architecture is:
- keep dispatch, budget enforcement, interrupt handling, idle, and fallback in the kernel;
- expose policy knobs through capabilities;
- let a privileged scheduler-policy service own admission, budget selection, CPU partitioning, isolation leases, and tuning;
- call the policy service on configuration changes, depletion/timeout faults, and coarse placement events, not on every context switch.
Dynamic policy loading is a later experiment. It should not become the first way to make basic SMP scheduling scale.
Datacenter Runtime Schedulers
Shenango, Caladan, Shinjuku, and Arachne target microsecond-scale service latency by managing cores, preempting long request handlers, and separating fast user-level scheduling from coarser kernel control. They are useful because capOS will host services, agent runtimes, and network stacks that want low tail latency.
The shared lessons are:
- core grants are different from CPU-time budgets;
- user-level worker schedulers need kernel-visible blocking and preemption boundaries;
- tail-latency policies need request-level telemetry, not only thread-level CPU shares;
- cross-core coordination must be cheap enough that it does not dominate the service latency it tries to reduce.
For capOS this argues for scheduler hints and policy capabilities above the kernel mechanism, not a datacenter-specific kernel scheduler as the default.
Stateful Work Graphs
The stateful task/job graph proposal is related at the workload layer. A graph node can carry assignment metadata such as priority, deadline, budget, queue, and lease, and a domain coordinator can decide which node attempt is runnable inside that graph. That is not the same authority as kernel CPU scheduling.
The scheduler consequence is a clean boundary:
- graph/node priority is domain policy until translated by an authorized scheduler policy service;
- graph budgets reference resource profiles or scheduling contexts, but do not mint CPU time by themselves;
- graph deadlines may create request deadlines or admission inputs, but do not bypass scheduler admission;
- build, init, agent, and operator graph coordinators should lease work and consume scheduler primitives rather than owning a global CPU run queue;
- scheduler telemetry should be attachable to graph runs as artifacts, so a failed or slow job can explain whether it waited on authority, CPU budget, dependency state, I/O, or policy.
Recommended capOS Direction
- Finish the current thread-scale evidence before larger policy changes.
- Split scheduler state into per-CPU runnable ownership and bounded cross-CPU wake/migration.
- Add precise CPU accounting and scheduler attribution before changing the default policy.
- Move ordinary best-effort work toward an EEVDF-like virtual-deadline policy after accounting and per-CPU queues exist.
- Keep
SCHED_DEADLINE/EDF-CBS and seL4 MCS as the precedent for admitted realtime work, but express CPU authority as capOSSchedulingContextcapabilities. - Keep user-space scheduler policy coarse-grained and capability-authorized; do not consult a user process on every timer interrupt or dispatch.
- Treat SQPOLL, busy polling, and full-nohz as CPU-isolation leases with housekeeping and revocation constraints.
- Keep runtime schedulers above per-thread rings, futex/park/notification primitives, timers, and explicit thread objects.
The resulting target is a layered scheduler:
- Kernel dispatch/enforcement: per-CPU queues, context switch, idle, accounting, budget enforcement, timeout faults, direct IPC donation, and cross-CPU wake/migration.
- Kernel policy primitives: weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, and realtime-island admission hooks.
- Userspace policy: profiles, admission, budget selection, service/runtime hints, placement, diagnostics, and policy reload.
- Userspace runtimes: work stealing, actor queues, async reactors, service request schedulers, and language-specific M:N scheduling.
Open Questions
- What minimum per-CPU queue split closes the current thread-scale milestone without prematurely designing the whole fair scheduler?
- Should the first virtual-time policy be simple weighted fair queueing before EEVDF, or should capOS move directly to virtual-deadline accounting once runtime charging exists?
- What is the smallest
SchedulingContextABI that supports endpoint donation without freezing an unsuitable realtime contract? - How should
CpuIsolationLeaserevocation interact with session logout, service replacement, and process exit cleanup? - Which scheduler telemetry belongs in the always-on kernel and which belongs
behind the benchmark-only
measurefeature?