Scheduler Evolution Backlog

This backlog decomposes the future scheduler architecture from Scheduler Evolution. It does not replace the selected In-Process Threading Scalability milestone; use it when work moves from current thread-scale triage into scheduler architecture.

Design Grounding Checklist

Before implementation slices, read:

docs/architecture/scheduling.md
docs/backlog/smp-phase-c.md
docs/proposals/smp-proposal.md
docs/proposals/ring-v2-smp-proposal.md
docs/proposals/tickless-realtime-scheduling-proposal.md
docs/proposals/stateful-task-job-graphs-proposal.md
docs/proposals/scheduler-evolution-proposal.md
docs/research/future-scheduler-architecture.md
docs/research/nohz-sqpoll-realtime.md
docs/research/out-of-kernel-scheduling.md
docs/research/completion-ring-threading.md

For realtime or isolation slices, also read:

docs/research/multimedia-pipeline-latency.md
docs/research/robotics-realtime-control.md
docs/research/x2apic-and-virtualization.md

Phase A: Attribution and Guardrails

Finish thread-scale attribution. First-pass scheduler candidate/outcome, reschedule-IPI, serial-byte, and scheduler-lock counters now exist behind CAPOS_THREAD_SCALE_GUEST_MEASURE=1 and have been confirmed on capos-bench, and timer interrupt counters now exist in the same guest-measure path. First-pass CR3/TLB counters now exist in the same guest-measure path; remaining slices are guest-symbol or guest-PC samples, workload/cacheline A/B evidence, and logging-suppression A/B evidence.
Add a benchmark-kernel mode that suppresses per-context-switch logging during measured cases so serial MMIO cannot masquerade as scheduler cost.
Decide which counters are permanent observability and which stay behind measure.
Record controlled capos-bench evidence before and after each scheduler structure change.

Phase B: Per-CPU Runnable Ownership

Define PerCpuRunQueue ownership invariants: one runnable owner per live generation-checked ThreadRef, no duplicate runnable placement, explicit migration state, and bounded steal path.
Split current-thread and runnable ownership from shared process/thread metadata without widening emergency-path allocation.
Add cross-CPU wake policy for endpoint, timer, park, process wait, and thread join completions.
Add bounded reschedule IPI behavior for idle-to-runnable transitions.
Preserve direct IPC handoff as a scheduling preference without bypassing per-CPU ownership or generation checks.
Prove process/thread exit cleanup cannot leave a stale runnable entry on any CPU queue.
Rerun make run-thread-scale, make run-smp2-smokes, ordinary smoke, spawn/thread, park, ring, and process-exit focused proofs.

Phase C: CPU Accounting

Add monotonic runtime charge points at context switch, preemption, blocking syscall, direct IPC handoff, unblock, and thread exit.
Track per-thread runtime, virtual runtime seed, context switches, preemptions, voluntary blocks, and migrations.
Add process/session/service aggregation only after the per-thread record has a single ledger of record.
Add tests or QEMU diagnostics proving runtime increases while running and stops while blocked.
Keep runtime accounting independent of tickless idle by using the monotonic clocksource layer.

Phase D: Best-Effort Fair Scheduling

Choose initial weighted-fair or EEVDF-like policy based on accounting and queue data.
Add scheduler entity weights and latency class metadata through a capability-authorized policy path, not ambient process fields.
Preserve fairness across CPU migration.
Test CPU hogs, short sleepers, direct IPC server/client pairs, multi-process load, and same-process sibling load.
Define overload behavior when runnable entities exceed the selected CPU set or when migration cannot keep up.

Phase E: SchedulingContext Capability

Define the first SchedulingContext object shape: budget, period, relative deadline, CPU mask, replenishment state, timeout endpoint, and overrun policy.
Add capability creation/bind/revoke rules and generation identity.
Enforce budget and replenishment in the kernel dispatcher.
Add endpoint donation/return semantics for synchronous calls and passive services.
Add timeout/depletion notifications with preallocated emergency-path storage.
Prove stale scheduling contexts fail closed after revoke, process exit, and session logout.

Phase F: CPU Isolation Lease and SQPOLL

Define CpuIsolationLease authority separately from CPU-time budget.
Add scheduler activation proof for housekeeping, deferred cleanup, timers, networking, IRQ affinity, accounting target, and revocation latency.
Integrate SQPOLL ring mode only after one-SQ-consumer ring ownership is enforced.
Add lease revocation on explicit revoke, process exit, service replacement, and session close.
Add nohz activation/deactivation telemetry.

Phase G: Realtime Islands

Define RealtimeIsland admission inputs: scheduling contexts, memory reservations, device/IRQ reservations, communication paths, CPU leases, and overrun policy.
Add a small local-audio or synthetic periodic-control proof before robotics or provider workloads.
Prove no allocation, blocking endpoint call, paging, or logging on the admitted realtime path.
Record deadline misses and overrun handling as observable output.

Phase H: Policy Service

Define a privileged scheduler policy service interface for admission, budget/profile updates, CPU lease grant/revoke, and diagnostics.
Keep kernel fallback scheduling independent of policy-service liveness.
Add manifest/config hooks for default profiles without making policy changes require kernel rebuilds.
Add operator diagnostics that explain why a thread or island was denied, throttled, migrated, or revoked.
Define how stateful task/job graph assignment metadata maps into scheduler policy inputs: graph priority to weight/latency class, graph deadline to request freshness or admission input, graph budget to SchedulingContext reference, and graph queue to policy-service placement. The graph coordinator must not mint CPU authority by itself.

Keyboard shortcuts

capOS Documentation