Proposal: Scheduler Evolution

capOS should evolve its scheduler in layers. The goal is not one clever algorithm; it is a capability-shaped CPU subsystem that scales ordinary work, admits realtime islands, allows service/runtime-specific policy, and preserves a small auditable kernel dispatch path.

This proposal complements, rather than replaces, Tickless and Realtime Scheduling. That proposal owns timer/tickless/SQPOLL-nohz details. This proposal owns the broader scheduler architecture and roadmap.

Design Grounding

Local grounding:

Goals

Keep protected dispatch, budget enforcement, interrupt handling, and idle in the kernel.
Replace the single global runnable queue with per-CPU runnable ownership and bounded cross-CPU wake/migration.
Add CPU accounting before adopting policy that depends on runtime charge.
Make ordinary best-effort scheduling fair by virtual time, with EEVDF-like virtual-deadline scheduling as the target after accounting exists.
Represent admitted CPU time as SchedulingContext capability authority.
Represent isolated CPU ownership as CpuIsolationLease authority.
Support user-space scheduler policy services for admission and tuning without putting user-space calls on every dispatch path.
Provide enough telemetry to distinguish scheduler cost, serial/MMIO logging, TLB/CR3 effects, QEMU/KVM artifacts, and workload contention.

Non-Goals

Do not import Linux CFS/EEVDF, FreeBSD ULE, or sched_ext as code.
Do not expose arbitrary user-supplied scheduler programs in the kernel in the near term.
Do not make a user-space process the mandatory next-thread dispatcher.
Do not claim hard realtime until admission, budget enforcement, IRQ/device behavior, kernel-path latency, and WCET evidence exist.
Do not make nohz/full-nohz a thread flag. It is a CPU lease plus scheduler proof.

Architecture

The target scheduler has four layers:

Kernel mechanism: per-CPU run queues, current-thread state, idle, context switch, cross-CPU wake/migration, timer/IPI handling, CPU accounting, budget enforcement, and timeout/depletion faults.
Kernel policy primitives: best-effort weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, direct IPC donation, and realtime-island hooks.
Privileged scheduler policy service: admission, budget/profile selection, CPU partitioning, isolation grants, service/runtime hints, policy reload, and operator diagnostics.
Application/runtime schedulers: work stealing, actors, async reactors, language M:N schedulers, request queues, and service-local priority and batching.

The hot path remains local and bounded: timer interrupt or wakeup, charge runtime, update runnable state, pick from a per-CPU queue or a bounded steal path, switch context. User-space policy participates at slower boundaries: profile changes, thread/process creation, budget depletion, realtime admission, lease grant/revoke, or explicit operator policy updates.

Stateful task/job graph coordinators sit above these layers. They may own graph node queues, leases, retry state, cancellation, and assignment metadata, but they do not own CPU dispatch. A graph node’s priority, deadline, budget, or queue field is workload policy until a capability-authorized scheduler policy service maps it to a weight, scheduling context, CPU lease, or request deadline.

Stage 0: Evidence Before Policy

Before changing the default policy, finish the active thread-scale attribution work:

first-pass scheduler candidate/outcome, reschedule-IPI, serial-byte, and scheduler-lock counters, now present behind CAPOS_THREAD_SCALE_GUEST_MEASURE=1 and confirmed on capos-bench;
timer interrupt counters per case, now present in the same guest-measure path;
CR3/TLB event counts per case, now present in the same guest-measure path;
guest-symbol or guest-PC samples;
cacheline/workload A/B baseline.
logging-suppression A/B evidence for per-context-switch diagnostic output.

This protects the design from treating QEMU/KVM, serial MMIO, or benchmark cache contention as a scheduler algorithm problem.

Stage 1: Per-CPU Runnable Ownership

Split the scheduler’s runnable state first. The accepted initial shape has per-CPU run queues with a runnable ThreadRef deque or priority buckets, current-thread state, a local reschedule flag, and local counters. Shared scheduler state keeps process/thread metadata, sleeping/deadline waiters, blocked waiters, migration records, and the global policy epoch.

Rules:

A runnable ThreadRef is owned by exactly one CPU queue at a time.
Cross-CPU wake enqueues to the target CPU or a policy-selected CPU and sends a bounded reschedule IPI when needed.
Migration removes from one owner before publishing to another.
Idle CPUs steal only through bounded policy, not by scanning every process.
Process exit and thread exit keep cleanup bounded and must not allocate in interrupt, cancellation, or emergency paths.

This stage may still use round-robin within each CPU queue. The objective is SMP structure and evidence, not perfect fairness.

Stage 2: CPU Accounting

Add a monotonic runtime charge model. ThreadCpuAccount records runtime, last-start time, virtual runtime, context switches, preemptions, and voluntary blocks. SchedEntity records weight, latency class, eligible time, and virtual deadline.

Accounting must be stable enough to support fair scheduling, quotas, and future scheduling contexts. It must account context switches, blocking syscalls, endpoint direct handoff, timer preemption, thread exit, and idle.

Where exact cycle attribution is not yet credible, the implementation should label the metric as diagnostic rather than enforcing policy from it.

Stage 3: Best-Effort Fair Policy

Once per-CPU queues and accounting exist, capOS should move ordinary best-effort scheduling from FIFO round-robin toward virtual-time fairness.

The target policy is EEVDF-like:

runnable entities accrue lag against their fair share;
eligible entities are ordered by virtual deadline;
weights affect virtual runtime/deadline progression;
latency-sensitive best-effort entities can request smaller slices within policy limits;
migration preserves accounting so moving CPUs does not reset fairness.

Initial implementation may use a simpler weighted fair queue if that gets runtime accounting and migration correct sooner. The proposal should not freeze an exact tree/heap representation until the per-CPU queue and accounting measurements exist.

Stage 4: Scheduling Contexts

CPU-time authority becomes a capability. SchedulingContext records budget, period, relative deadline, priority or criticality, CPU mask, remaining budget, replenishment state, timeout endpoint, and overrun policy.

Kernel responsibilities:

enforce budget and replenishment;
prevent a thread without eligible CPU authority from running;
charge runtime to exactly one authority target;
donate and return scheduling contexts across synchronous endpoint paths;
post timeout/depletion notifications without allocating in emergency paths.

Policy-service responsibilities:

admit or reject scheduling contexts;
choose budget/period/priority;
bind contexts to threads/services;
revoke or adjust contexts safely;
record operator-visible decisions.

SQE.deadline_ns remains request metadata. It may influence drop, freshness, propagation, and telemetry, but it does not grant CPU budget.

Stage 5: CPU Isolation Leases and SQPOLL

CpuIsolationLease grants placement and exclusivity, not CPU time. It records the owner process/session/service, CPU set, mode, housekeeping exclusions, accounting target, maximum revocation latency, and revoke endpoint.

Activation requires scheduler proof:

at least one housekeeping CPU remains online;
unrelated timers, deferred cleanup, network polling, and debug/watchdog work are not pinned to the isolated CPU;
the active ring has exactly one SQ consumer;
the accounting target is valid and chargeable;
revocation latency fits the lease policy.

SQPOLL uses the ring-mode contract in Tickless and Realtime Scheduling. The scheduler proposal adds the CPU-ownership and policy-service side of that contract.

Stage 6: Realtime Islands

A RealtimeIsland is an admitted graph, not a single priority. It records scheduling contexts, memory reservations, device and IRQ reservations, rings/endpoints/notifications, any CPU isolation leases, admission evidence, and overrun/shutdown policy.

Use cases include local audio, realtime voice, robotics control, and selected provider/runtime loops. Admission must fail closed if the graph cannot fit the declared period/quantum and reservations.

Stage 7: User-Space Scheduler Policy

After kernel primitives are in place, a privileged scheduler policy service can own:

default resource profiles;
session/account/service CPU policy;
scheduling-context admission;
CPU lease grant/revoke;
runtime hints such as latency-sensitive, batch, driver, poller, or agent;
operator-facing diagnostics and policy reload.

The kernel still owns emergency fallback. If the policy service is dead, blocked, stale, or malicious, the kernel must continue to enforce safety, revoke leases as policy permits, and schedule a minimal recovery path.

Validation Gates

Per-CPU queue work must preserve run-smoke, run-spawn, run-thread-scale, park/ring/process-exit smokes, and SMP smokes.
A thread-scale milestone closeout must include repeated controlled capos-bench evidence and raw logs.
CPU accounting must include sanity tests that measured runtime increases monotonically while a thread runs and stops while it is blocked.
Fair policy changes must include adversarial tests: CPU hogs, short sleepers, direct IPC handoff, multi-process load, and same-process sibling load.
Scheduling-context work must include admission rejection, budget depletion, replenishment, endpoint donation/return, timeout notification, and stale cap revocation tests.
CPU leases must include revocation, process exit, session close, and housekeeping fallback tests.
Realtime island proofs must show preallocation, no allocation/blocking on admitted paths, deadline miss telemetry, and fail-closed overrun behavior.

Open Decisions

Whether the first best-effort fair policy should be weighted fair queueing or direct EEVDF.
Whether scheduling-context priority is a scalar, a criticality band, or both.
Whether SchedulingContext should be bindable to a process default, individual thread, endpoint call path, or all three in the first ABI.
Which scheduler telemetry is permanent ABI and which is benchmark-only.
How much policy-service state belongs in the boot manifest versus mutable operator configuration.

Keyboard shortcuts

capOS Documentation