# Research: Future Scheduler Architecture

This note records the prior art checked for future capOS scheduling work after
the first SMP and per-thread ring milestones exposed that scheduler structure,
not only timer programming, will decide whether capOS scales.

## Local Grounding

Existing capOS documents already cover part of the answer:

- [Scheduling](../architecture/scheduling.md): current round-robin scheduler,
  LAPIC timer/IPI foundation, global run queue, per-thread rings, timer sleep
  waiters, park waiters, direct IPC handoff, and current SMP limitations.
- [SMP Phase C](../backlog/smp-phase-c.md): active multi-CPU execution and
  in-process thread-scaling proof work.
- [SMP](../proposals/smp-proposal.md): accepted SMP direction.
- [Ring v2 For Full SMP](../proposals/ring-v2-smp-proposal.md): per-thread
  completion ownership and SQPOLL preconditions.
- [Tickless and Realtime Scheduling](../proposals/tickless-realtime-scheduling-proposal.md):
  tickless idle, SQPOLL nohz, deadlines, scheduling contexts, donation, and
  realtime islands.
- [Stateful Task and Job Graphs](../proposals/stateful-task-job-graphs-proposal.md):
  durable work graphs, assignment metadata, graph-run state, and the stop line
  that graph coordinators must not become authority-holding god objects.
- [NO_HZ, SQPOLL, and Realtime Scheduling](nohz-sqpoll-realtime.md): Linux
  NO_HZ, clocksource/clockevent, CPU isolation, SQPOLL, `SCHED_DEADLINE`,
  PREEMPT_RT, and seL4 MCS grounding.
- [Out-of-kernel scheduling](out-of-kernel-scheduling.md): kernel mechanism
  versus user-space policy split.
- [Completion rings and threaded runtimes](completion-ring-threading.md):
  completion ownership, `io_uring`, futex, and IOCP lessons.
- [Multimedia pipeline latency](multimedia-pipeline-latency.md) and
  [Robotics realtime control](robotics-realtime-control.md): admitted
  realtime-island use cases.

## External Sources Checked

- Linux kernel documentation,
  [CFS Scheduler](https://docs.kernel.org/scheduler/sched-design-CFS.html).
- Linux kernel documentation,
  [EEVDF Scheduler](https://www.kernel.org/doc/html/latest/scheduler/sched-eevdf.html).
- Linux kernel documentation,
  [Deadline Task Scheduling](https://www.kernel.org/doc/html/latest/scheduler/sched-deadline.html).
- Linux kernel documentation,
  [Extensible Scheduler Class](https://docs.kernel.org/scheduler/sched-ext.html).
- Linux kernel documentation,
  [NO_HZ: Reducing Scheduling-Clock Ticks](https://docs.kernel.org/timers/no_hz.html).
- Linux kernel documentation,
  [CPU Isolation](https://www.kernel.org/doc/html/next/admin-guide/cpu-isolation.html).
- Linux kernel documentation,
  [Housekeeping](https://www.kernel.org/doc/html/latest/core-api/housekeeping.html).
- Linux kernel documentation,
  [PREEMPT_RT theory of operation](https://docs.kernel.org/core-api/real-time/theory.html).
- FreeBSD manual,
  [ULE scheduler](https://man.freebsd.org/cgi/man.cgi?query=sched_ule&manpath=FreeBSD+11.0-RELEASE).
- seL4 documentation,
  [MCS Extensions tutorial](https://docs.sel4.systems/Tutorials/mcs.html).
- Anderson, Bershad, Lazowska, and Levy,
  [Scheduler Activations](https://homes.cs.washington.edu/~tom/pubs/sched_act.html).
- Ghosh et al.,
  [ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling](https://dl.acm.org/doi/10.1145/3477132.3483542).
- Ousterhout et al.,
  [Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads](https://www.usenix.org/conference/nsdi19/presentation/ousterhout).
- Fried et al.,
  [Caladan: Mitigating Interference at Microsecond Timescales](https://www.usenix.org/conference/osdi20/presentation/fried).
- Kaffes et al.,
  [Shinjuku: Preemptive Scheduling for microsecond-scale Tail Latency](https://www.usenix.org/conference/nsdi19/presentation/kaffes).
- Qin et al.,
  [Arachne: Core-Aware Thread Management](https://www.usenix.org/conference/osdi18/presentation/qin).

## Findings

### Fair General-Purpose Scheduling

Linux CFS established the now-common model that ordinary tasks should be
ordered by virtual runtime, not by a fixed time-slice list. Linux EEVDF keeps
the fair-scheduler lineage but chooses the eligible task with the earliest
virtual deadline, using request size and lag to improve latency and fairness.

The capOS consequence is not "import Linux CFS." It is:

- the current FIFO round-robin run queue is only a bootstrap policy;
- ordinary best-effort work should eventually use virtual-time accounting;
- latency-sensitive best-effort work should have bounded, policy-visible
  request sizes or weights rather than hidden scheduler magic;
- per-CPU run queues are a prerequisite before any EEVDF-like policy matters
  at SMP scale.

EEVDF is the strongest candidate for capOS default best-effort policy once
CPU accounting exists. It should not be the first scheduler refactor because
it depends on accurate runtime charging, per-CPU runnable ownership, and
migration accounting.

### Per-CPU Run Queues and Topology

Linux and FreeBSD both make per-CPU scheduler state the normal SMP unit.
FreeBSD ULE additionally exposes CPU topology and affinity as first-class
placement concerns. This matches the current capOS scaling evidence: one
global scheduler lock and one global run queue make every CPU contend on the
same state even after per-thread rings remove the process-wide CQ bottleneck.

The near-term capOS scheduling architecture should split:

- per-CPU current thread and run queue ownership;
- cross-CPU wakeup and migration paths;
- shared process/thread metadata protected by narrower locks;
- placement policy from dispatch mechanism;
- diagnostic counters for lock hold/spin time, migration, steals, and IPIs.

### Realtime and Temporal Isolation

Linux `SCHED_DEADLINE` uses EDF plus Constant Bandwidth Server-style budget,
deadline, and period parameters. Its key lesson for capOS is admission:
deadline scheduling without bandwidth control is only a priority policy, not
a guarantee.

seL4 MCS is the more capability-native precedent. CPU time is represented by
scheduling-context objects. Passive servers can run on caller-donated CPU
time, avoiding priority inversion across synchronous IPC. This maps directly
to capOS endpoint services and direct IPC handoff.

The capOS split should remain:

- `SQE.deadline_ns`: request freshness and propagation metadata;
- `SchedulingContext`: spendable CPU-time authority;
- donation: temporary transfer of CPU budget/deadline along a synchronous
  capability path;
- `RealtimeIsland`: admitted bundle of scheduling contexts, memory/device
  reservations, communication paths, and overrun policy.

### Tickless, Isolation, and Housekeeping

Linux NO_HZ and CPU isolation reinforce that tick suppression is not one
feature. Idle tickless is a timer cleanup. Full-nohz is an isolation contract
that also needs housekeeping CPUs, accounting, timer migration, deferred work
placement, and revocation latency policy.

For capOS:

- tickless idle should stay the first timer milestone;
- full-nohz should be tied to a `CpuIsolationLease`;
- SQPOLL nohz should require one SQ consumer, per-thread rings, explicit
  housekeeping, and network polling placement;
- realtime islands may use nohz only after admission proves that unrelated
  work, IRQs, deferred frees, and timers are excluded or bounded.

### Pluggable and User-Space Policy

Linux sched_ext and ghOSt show that fast scheduler experimentation is useful,
but they also preserve privileged dispatch and enforcement. sched_ext runs BPF
inside the kernel scheduler framework with fallback; ghOSt delegates policy to
user-space agents while retaining kernel mechanisms for safety and preemption.

For capOS, the safe architecture is:

- keep dispatch, budget enforcement, interrupt handling, idle, and fallback in
  the kernel;
- expose policy knobs through capabilities;
- let a privileged scheduler-policy service own admission, budget selection,
  CPU partitioning, isolation leases, and tuning;
- call the policy service on configuration changes, depletion/timeout faults,
  and coarse placement events, not on every context switch.

Dynamic policy loading is a later experiment. It should not become the first
way to make basic SMP scheduling scale.

### Datacenter Runtime Schedulers

Shenango, Caladan, Shinjuku, and Arachne target microsecond-scale service
latency by managing cores, preempting long request handlers, and separating
fast user-level scheduling from coarser kernel control. They are useful
because capOS will host services, agent runtimes, and network stacks that want
low tail latency.

The shared lessons are:

- core grants are different from CPU-time budgets;
- user-level worker schedulers need kernel-visible blocking and preemption
  boundaries;
- tail-latency policies need request-level telemetry, not only thread-level
  CPU shares;
- cross-core coordination must be cheap enough that it does not dominate the
  service latency it tries to reduce.

For capOS this argues for scheduler hints and policy capabilities above the
kernel mechanism, not a datacenter-specific kernel scheduler as the default.

### Stateful Work Graphs

The stateful task/job graph proposal is related at the workload layer. A graph
node can carry assignment metadata such as priority, deadline, budget, queue,
and lease, and a domain coordinator can decide which node attempt is runnable
inside that graph. That is not the same authority as kernel CPU scheduling.

The scheduler consequence is a clean boundary:

- graph/node priority is domain policy until translated by an authorized
  scheduler policy service;
- graph budgets reference resource profiles or scheduling contexts, but do not
  mint CPU time by themselves;
- graph deadlines may create request deadlines or admission inputs, but do not
  bypass scheduler admission;
- build, init, agent, and operator graph coordinators should lease work and
  consume scheduler primitives rather than owning a global CPU run queue;
- scheduler telemetry should be attachable to graph runs as artifacts, so a
  failed or slow job can explain whether it waited on authority, CPU budget,
  dependency state, I/O, or policy.

## Recommended capOS Direction

1. Finish the current thread-scale evidence before larger policy changes.
2. Split scheduler state into per-CPU runnable ownership and bounded
   cross-CPU wake/migration.
3. Add precise CPU accounting and scheduler attribution before changing the
   default policy.
4. Move ordinary best-effort work toward an EEVDF-like virtual-deadline policy
   after accounting and per-CPU queues exist.
5. Keep `SCHED_DEADLINE`/EDF-CBS and seL4 MCS as the precedent for admitted
   realtime work, but express CPU authority as capOS `SchedulingContext`
   capabilities.
6. Keep user-space scheduler policy coarse-grained and capability-authorized;
   do not consult a user process on every timer interrupt or dispatch.
7. Treat SQPOLL, busy polling, and full-nohz as CPU-isolation leases with
   housekeeping and revocation constraints.
8. Keep runtime schedulers above per-thread rings, futex/park/notification
   primitives, timers, and explicit thread objects.

The resulting target is a layered scheduler:

- Kernel dispatch/enforcement: per-CPU queues, context switch, idle,
  accounting, budget enforcement, timeout faults, direct IPC donation, and
  cross-CPU wake/migration.
- Kernel policy primitives: weights, virtual deadlines, scheduling contexts,
  CPU masks, isolation leases, and realtime-island admission hooks.
- Userspace policy: profiles, admission, budget selection, service/runtime
  hints, placement, diagnostics, and policy reload.
- Userspace runtimes: work stealing, actor queues, async reactors, service
  request schedulers, and language-specific M:N scheduling.

## Open Questions

- What minimum per-CPU queue split closes the current thread-scale milestone
  without prematurely designing the whole fair scheduler?
- Should the first virtual-time policy be simple weighted fair queueing before
  EEVDF, or should capOS move directly to virtual-deadline accounting once
  runtime charging exists?
- What is the smallest `SchedulingContext` ABI that supports endpoint donation
  without freezing an unsuitable realtime contract?
- How should `CpuIsolationLease` revocation interact with session logout,
  service replacement, and process exit cleanup?
- Which scheduler telemetry belongs in the always-on kernel and which belongs
  behind the benchmark-only `measure` feature?