Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Scheduler Evolution Backlog

This backlog decomposes the future scheduler architecture from Scheduler Evolution. It does not replace the selected In-Process Threading Scalability milestone; use it when work moves from current thread-scale triage into scheduler architecture.

Design Grounding Checklist

Before implementation slices, read:

  • docs/architecture/scheduling.md
  • docs/backlog/smp-phase-c.md
  • docs/proposals/smp-proposal.md
  • docs/proposals/ring-v2-smp-proposal.md
  • docs/proposals/tickless-realtime-scheduling-proposal.md
  • docs/proposals/stateful-task-job-graphs-proposal.md
  • docs/proposals/scheduler-evolution-proposal.md
  • docs/research/future-scheduler-architecture.md
  • docs/research/nohz-sqpoll-realtime.md
  • docs/research/out-of-kernel-scheduling.md
  • docs/research/completion-ring-threading.md

For realtime or isolation slices, also read:

  • docs/research/multimedia-pipeline-latency.md
  • docs/research/robotics-realtime-control.md
  • docs/research/x2apic-and-virtualization.md

Phase A: Attribution and Guardrails

  • Finish thread-scale attribution. First-pass scheduler candidate/outcome, reschedule-IPI, serial-byte, and scheduler-lock counters now exist behind CAPOS_THREAD_SCALE_GUEST_MEASURE=1 and have been confirmed on capos-bench, and timer interrupt counters now exist in the same guest-measure path. First-pass CR3/TLB counters now exist in the same guest-measure path; remaining slices are guest-symbol or guest-PC samples, workload/cacheline A/B evidence, and logging-suppression A/B evidence.
  • Add a benchmark-kernel mode that suppresses per-context-switch logging during measured cases so serial MMIO cannot masquerade as scheduler cost.
  • Decide which counters are permanent observability and which stay behind measure.
  • Record controlled capos-bench evidence before and after each scheduler structure change.

Phase B: Per-CPU Runnable Ownership

  • Define PerCpuRunQueue ownership invariants: one runnable owner per live generation-checked ThreadRef, no duplicate runnable placement, explicit migration state, and bounded steal path.
  • Split current-thread and runnable ownership from shared process/thread metadata without widening emergency-path allocation.
  • Add cross-CPU wake policy for endpoint, timer, park, process wait, and thread join completions.
  • Add bounded reschedule IPI behavior for idle-to-runnable transitions.
  • Preserve direct IPC handoff as a scheduling preference without bypassing per-CPU ownership or generation checks.
  • Prove process/thread exit cleanup cannot leave a stale runnable entry on any CPU queue.
  • Rerun make run-thread-scale, make run-smp2-smokes, ordinary smoke, spawn/thread, park, ring, and process-exit focused proofs.

Phase C: CPU Accounting

  • Add monotonic runtime charge points at context switch, preemption, blocking syscall, direct IPC handoff, unblock, and thread exit.
  • Track per-thread runtime, virtual runtime seed, context switches, preemptions, voluntary blocks, and migrations.
  • Add process/session/service aggregation only after the per-thread record has a single ledger of record.
  • Add tests or QEMU diagnostics proving runtime increases while running and stops while blocked.
  • Keep runtime accounting independent of tickless idle by using the monotonic clocksource layer.

Phase D: Best-Effort Fair Scheduling

  • Choose initial weighted-fair or EEVDF-like policy based on accounting and queue data.
  • Add scheduler entity weights and latency class metadata through a capability-authorized policy path, not ambient process fields.
  • Preserve fairness across CPU migration.
  • Test CPU hogs, short sleepers, direct IPC server/client pairs, multi-process load, and same-process sibling load.
  • Define overload behavior when runnable entities exceed the selected CPU set or when migration cannot keep up.

Phase E: SchedulingContext Capability

  • Define the first SchedulingContext object shape: budget, period, relative deadline, CPU mask, replenishment state, timeout endpoint, and overrun policy.
  • Add capability creation/bind/revoke rules and generation identity.
  • Enforce budget and replenishment in the kernel dispatcher.
  • Add endpoint donation/return semantics for synchronous calls and passive services.
  • Add timeout/depletion notifications with preallocated emergency-path storage.
  • Prove stale scheduling contexts fail closed after revoke, process exit, and session logout.

Phase F: CPU Isolation Lease and SQPOLL

  • Define CpuIsolationLease authority separately from CPU-time budget.
  • Add scheduler activation proof for housekeeping, deferred cleanup, timers, networking, IRQ affinity, accounting target, and revocation latency.
  • Integrate SQPOLL ring mode only after one-SQ-consumer ring ownership is enforced.
  • Add lease revocation on explicit revoke, process exit, service replacement, and session close.
  • Add nohz activation/deactivation telemetry.

Phase G: Realtime Islands

  • Define RealtimeIsland admission inputs: scheduling contexts, memory reservations, device/IRQ reservations, communication paths, CPU leases, and overrun policy.
  • Add a small local-audio or synthetic periodic-control proof before robotics or provider workloads.
  • Prove no allocation, blocking endpoint call, paging, or logging on the admitted realtime path.
  • Record deadline misses and overrun handling as observable output.

Phase H: Policy Service

  • Define a privileged scheduler policy service interface for admission, budget/profile updates, CPU lease grant/revoke, and diagnostics.
  • Keep kernel fallback scheduling independent of policy-service liveness.
  • Add manifest/config hooks for default profiles without making policy changes require kernel rebuilds.
  • Add operator diagnostics that explain why a thread or island was denied, throttled, migrated, or revoked.
  • Define how stateful task/job graph assignment metadata maps into scheduler policy inputs: graph priority to weight/latency class, graph deadline to request freshness or admission input, graph budget to SchedulingContext reference, and graph queue to policy-service placement. The graph coordinator must not mint CPU authority by itself.