Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Research: Future Scheduler Architecture

This note records the prior art checked for future capOS scheduling work after the first SMP and per-thread ring milestones exposed that scheduler structure, not only timer programming, will decide whether capOS scales.

Local Grounding

Existing capOS documents already cover part of the answer:

External Sources Checked

Findings

Fair General-Purpose Scheduling

Linux CFS established the now-common model that ordinary tasks should be ordered by virtual runtime, not by a fixed time-slice list. Linux EEVDF keeps the fair-scheduler lineage but chooses the eligible task with the earliest virtual deadline, using request size and lag to improve latency and fairness.

The capOS consequence is not “import Linux CFS.” It is:

  • ordinary best-effort work should use virtual-time accounting (Phase D WFQ is now the active policy; the earlier FIFO round-robin was the bootstrap);
  • latency-sensitive best-effort work should have bounded, policy-visible request sizes or weights rather than hidden scheduler magic;
  • per-CPU run queues are a prerequisite before any EEVDF-like policy matters at SMP scale.

EEVDF is the strongest candidate for the next capOS best-effort policy evolution after WFQ. It should follow WFQ rather than replace it immediately because it depends on accurate runtime charging, per-CPU runnable ownership, and migration accounting that are not yet in place.

Per-CPU Run Queues and Topology

Linux and FreeBSD both make per-CPU scheduler state the normal SMP unit. FreeBSD ULE additionally exposes CPU topology and affinity as first-class placement concerns. This matches the current capOS scaling evidence: one global scheduler lock and one global run queue make every CPU contend on the same state even after per-thread rings remove the process-wide CQ bottleneck.

The near-term capOS scheduling architecture should split:

  • per-CPU current thread and run queue ownership;
  • cross-CPU wakeup and migration paths;
  • shared process/thread metadata protected by narrower locks;
  • placement policy from dispatch mechanism;
  • diagnostic counters for lock hold/spin time, migration, steals, and IPIs.

Realtime and Temporal Isolation

Linux SCHED_DEADLINE uses EDF plus Constant Bandwidth Server-style budget, deadline, and period parameters. Its key lesson for capOS is admission: deadline scheduling without bandwidth control is only a priority policy, not a guarantee.

seL4 MCS is the more capability-native precedent. CPU time is represented by scheduling-context objects. Passive servers can run on caller-donated CPU time, avoiding priority inversion across synchronous IPC. This maps directly to capOS endpoint services and direct IPC handoff.

The capOS split should remain:

  • SQE.deadline_ns: request freshness and propagation metadata;
  • SchedulingContext: spendable CPU-time authority;
  • donation: temporary transfer of CPU budget/deadline along a synchronous capability path;
  • RealtimeIsland: admitted bundle of scheduling contexts, memory/device reservations, communication paths, and overrun policy.

Tickless, Isolation, and Housekeeping

Linux NO_HZ and CPU isolation reinforce that tick suppression is not one feature. Idle tickless is a timer cleanup. Full-nohz is an isolation contract that also needs housekeeping CPUs, accounting, timer migration, deferred work placement, and revocation latency policy.

For capOS, this grounding shaped the implementation order: automatic nohz activation for the narrow single-runnable-entity window and SQPOLL-driven auto-nohz for ring-coupled leases are now implemented (Phase F), both tied to the CpuIsolationLease with housekeeping, deferred-work placement, clockevent deadline substrate, one-SQ-consumer ownership, and fail-closed rollback prerequisites satisfied first. Generic full-nohz for explicitly budgeted compute leases, timeout-based auto-revoke, and SQPOLL nohz for explicitly leased caller-thread rings have since landed. Remaining future work includes:

  • full-nohz tied to policy-service issuance and durable monitoring telemetry;
  • SQPOLL nohz beyond the current caller-thread ring-coupled lease shape;
  • realtime island nohz after admission proves unrelated work, IRQs, deferred frees, and timers are excluded or bounded.

Pluggable and User-Space Policy

Linux sched_ext and ghOSt show that fast scheduler experimentation is useful, but they also preserve privileged dispatch and enforcement. sched_ext runs BPF inside the kernel scheduler framework with fallback; ghOSt delegates policy to user-space agents while retaining kernel mechanisms for safety and preemption.

For capOS, the safe architecture is:

  • keep dispatch, budget enforcement, interrupt handling, idle, and fallback in the kernel;
  • expose policy knobs through capabilities;
  • let a privileged scheduler-policy service own admission, budget selection, CPU partitioning, isolation leases, and tuning;
  • call the policy service on configuration changes, depletion/timeout faults, and coarse placement events, not on every context switch.

Dynamic policy loading is a later experiment. It should not become the first way to make basic SMP scheduling scale.

Datacenter Runtime Schedulers

Shenango, Caladan, Shinjuku, and Arachne target microsecond-scale service latency by managing cores, preempting long request handlers, and separating fast user-level scheduling from coarser kernel control. They are useful because capOS will host services, agent runtimes, and network stacks that want low tail latency.

The shared lessons are:

  • core grants are different from CPU-time budgets;
  • user-level worker schedulers need kernel-visible blocking and preemption boundaries;
  • tail-latency policies need request-level telemetry, not only thread-level CPU shares;
  • cross-core coordination must be cheap enough that it does not dominate the service latency it tries to reduce.

For capOS this argues for scheduler hints and policy capabilities above the kernel mechanism, not a datacenter-specific kernel scheduler as the default.

Stateful Work Graphs

The stateful task/job graph proposal is related at the workload layer. A graph node can carry assignment metadata such as priority, deadline, budget, queue, and lease, and a domain coordinator can decide which node attempt is runnable inside that graph. That is not the same authority as kernel CPU scheduling.

The scheduler consequence is a clean boundary:

  • graph/node priority is domain policy until translated by an authorized scheduler policy service;
  • graph budgets reference resource profiles or scheduling contexts, but do not mint CPU time by themselves;
  • graph deadlines may create request deadlines or admission inputs, but do not bypass scheduler admission;
  • build, init, agent, and operator graph coordinators should lease work and consume scheduler primitives rather than owning a global CPU run queue;
  • scheduler telemetry should be attachable to graph runs as artifacts, so a failed or slow job can explain whether it waited on authority, CPU budget, dependency state, I/O, or policy.
  1. (Done: Phase D/E/F) Finish the current thread-scale evidence before larger policy changes. Phase D WFQ, Phase E SchedulingContext, and Phase F CpuIsolationLease / auto-nohz / SQPOLL-coupled nohz have landed.
  2. Split scheduler state into per-CPU runnable ownership and bounded cross-CPU wake/migration. (Per-CPU queues remain future work; Phase F.5.)
  3. Add precise CPU accounting and scheduler attribution before changing the default policy. (Attribution guardrails landed in Phase A; full per-CPU accounting is Phase F.5 follow-on.)
  4. Move ordinary best-effort work toward an EEVDF-like virtual-deadline policy after accounting and per-CPU queues exist. (WFQ is current; EEVDF is a follow-on evaluation deferred until per-CPU queues exist.)
  5. Keep SCHED_DEADLINE/EDF-CBS and seL4 MCS as the precedent for admitted realtime work, but express CPU authority as capOS SchedulingContext capabilities. (SchedulingContext is implemented; RealtimeIsland admission is Phase G future work.)
  6. Keep user-space scheduler policy coarse-grained and capability-authorized; do not consult a user process on every timer interrupt or dispatch.
  7. Treat SQPOLL, busy polling, and full-nohz as CPU-isolation leases with housekeeping and revocation constraints. (Ring-coupled SQPOLL nohz and generic full-nohz for explicitly budgeted compute leases are implemented; policy-service issuance remains future work.)
  8. Keep runtime schedulers above per-thread rings, futex/park/notification primitives, timers, and explicit thread objects.

The resulting target is a layered scheduler:

  • Kernel dispatch/enforcement: per-CPU queues, context switch, idle, accounting, budget enforcement, timeout faults, direct IPC donation, and cross-CPU wake/migration.
  • Kernel policy primitives: weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, and realtime-island admission hooks.
  • Userspace policy: profiles, admission, budget selection, service/runtime hints, placement, diagnostics, and policy reload.
  • Userspace runtimes: work stealing, actor queues, async reactors, service request schedulers, and language-specific M:N scheduling.

Open Questions

The questions below have been answered by Phase D/E/F implementation; they are kept for record and context:

  • Answered (Phase D): WFQ is the first virtual-time policy; EEVDF is deferred until per-CPU queues and runtime charging exist.
  • Answered (Phase D): The thread-scale milestone did not require a per-CPU queue split; WFQ on a global queue with per-thread weight sufficed.
  • Answered (Phase E): The initial SchedulingContext ABI uses SchedulingContextSpec (weight, latency class, budget, period, overrun policy) and SchedulingContextInfo for info-only read access; SchedulingContext.info() is method id 0 for stability. Donation/return through endpoints is implemented; realtime island admission is future work.
  • Answered (Phase F): CpuIsolationLease revocation interacts with session logout and process exit through lease-generation staling, which is the load-bearing rollback trigger; service replacement and process exit cleanup go through the same generation-staling path.

Remaining open questions:

  • What is the minimum per-CPU queue split that closes the full-SMP scalability milestone (Phase F.5) without prematurely designing the full fair scheduler?
  • How should policy-service issuance select and renew generic full-nohz and SQPOLL nohz leases beyond the current explicit local proofs?
  • Which scheduler telemetry belongs in the always-on kernel and which belongs behind the benchmark-only measure feature?
  • What is the right RealtimeIsland admission shape for admitted scheduling contexts, memory/device reservations, and overrun policy (Phase G)?