Scheduler Evolution Backlog
This backlog decomposes the future scheduler architecture from Scheduler Evolution. It does not replace the selected In-Process Threading Scalability milestone; use it when work moves from current thread-scale triage into scheduler architecture.
Design Grounding Checklist
Before implementation slices, read:
docs/architecture/scheduling.mddocs/backlog/smp-phase-c.mddocs/proposals/smp-proposal.mddocs/proposals/ring-v2-smp-proposal.mddocs/proposals/tickless-realtime-scheduling-proposal.mddocs/proposals/stateful-task-job-graphs-proposal.mddocs/proposals/scheduler-evolution-proposal.mddocs/research/future-scheduler-architecture.mddocs/research/nohz-sqpoll-realtime.mddocs/research/out-of-kernel-scheduling.mddocs/research/completion-ring-threading.md
For realtime or isolation slices, also read:
docs/research/multimedia-pipeline-latency.mddocs/research/robotics-realtime-control.mddocs/research/x2apic-and-virtualization.md
Phase A: Attribution and Guardrails
- Finish thread-scale attribution. First-pass scheduler
candidate/outcome, reschedule-IPI, serial-byte, and scheduler-lock
counters now exist behind
CAPOS_THREAD_SCALE_GUEST_MEASURE=1and have been confirmed oncapos-bench, and timer interrupt counters now exist in the same guest-measure path. First-pass CR3/TLB counters now exist in the same guest-measure path; remaining slices are guest-symbol or guest-PC samples, workload/cacheline A/B evidence, and logging-suppression A/B evidence. - Add a benchmark-kernel mode that suppresses per-context-switch logging during measured cases so serial MMIO cannot masquerade as scheduler cost.
- Decide which counters are permanent observability and which stay behind
measure. - Record controlled
capos-benchevidence before and after each scheduler structure change.
Phase B: Per-CPU Runnable Ownership
- Define
PerCpuRunQueueownership invariants: one runnable owner per live generation-checkedThreadRef, no duplicate runnable placement, explicit migration state, and bounded steal path. - Split current-thread and runnable ownership from shared process/thread metadata without widening emergency-path allocation.
- Add cross-CPU wake policy for endpoint, timer, park, process wait, and thread join completions.
- Add bounded reschedule IPI behavior for idle-to-runnable transitions.
- Preserve direct IPC handoff as a scheduling preference without bypassing per-CPU ownership or generation checks.
- Prove process/thread exit cleanup cannot leave a stale runnable entry on any CPU queue.
- Rerun
make run-thread-scale,make run-smp2-smokes, ordinary smoke, spawn/thread, park, ring, and process-exit focused proofs.
Phase C: CPU Accounting
- Add monotonic runtime charge points at context switch, preemption, blocking syscall, direct IPC handoff, unblock, and thread exit.
- Track per-thread runtime, virtual runtime seed, context switches, preemptions, voluntary blocks, and migrations.
- Add process/session/service aggregation only after the per-thread record has a single ledger of record.
- Add tests or QEMU diagnostics proving runtime increases while running and stops while blocked.
- Keep runtime accounting independent of tickless idle by using the monotonic clocksource layer.
Phase D: Best-Effort Fair Scheduling
- Choose initial weighted-fair or EEVDF-like policy based on accounting and queue data.
- Add scheduler entity weights and latency class metadata through a capability-authorized policy path, not ambient process fields.
- Preserve fairness across CPU migration.
- Test CPU hogs, short sleepers, direct IPC server/client pairs, multi-process load, and same-process sibling load.
- Define overload behavior when runnable entities exceed the selected CPU set or when migration cannot keep up.
Phase E: SchedulingContext Capability
- Define the first
SchedulingContextobject shape: budget, period, relative deadline, CPU mask, replenishment state, timeout endpoint, and overrun policy. - Add capability creation/bind/revoke rules and generation identity.
- Enforce budget and replenishment in the kernel dispatcher.
- Add endpoint donation/return semantics for synchronous calls and passive services.
- Add timeout/depletion notifications with preallocated emergency-path storage.
- Prove stale scheduling contexts fail closed after revoke, process exit, and session logout.
Phase F: CPU Isolation Lease and SQPOLL
- Define
CpuIsolationLeaseauthority separately from CPU-time budget. - Add scheduler activation proof for housekeeping, deferred cleanup, timers, networking, IRQ affinity, accounting target, and revocation latency.
- Integrate SQPOLL ring mode only after one-SQ-consumer ring ownership is enforced.
- Add lease revocation on explicit revoke, process exit, service replacement, and session close.
- Add nohz activation/deactivation telemetry.
Phase G: Realtime Islands
- Define
RealtimeIslandadmission inputs: scheduling contexts, memory reservations, device/IRQ reservations, communication paths, CPU leases, and overrun policy. - Add a small local-audio or synthetic periodic-control proof before robotics or provider workloads.
- Prove no allocation, blocking endpoint call, paging, or logging on the admitted realtime path.
- Record deadline misses and overrun handling as observable output.
Phase H: Policy Service
- Define a privileged scheduler policy service interface for admission, budget/profile updates, CPU lease grant/revoke, and diagnostics.
- Keep kernel fallback scheduling independent of policy-service liveness.
- Add manifest/config hooks for default profiles without making policy changes require kernel rebuilds.
- Add operator diagnostics that explain why a thread or island was denied, throttled, migrated, or revoked.
- Define how stateful task/job graph assignment metadata maps into
scheduler policy inputs: graph priority to weight/latency class, graph
deadline to request freshness or admission input, graph budget to
SchedulingContextreference, and graph queue to policy-service placement. The graph coordinator must not mint CPU authority by itself.