# Scheduler Evolution Backlog

This backlog decomposes the future scheduler architecture from
[Scheduler Evolution](../proposals/scheduler-evolution-proposal.md). It does
not replace the selected **In-Process Threading Scalability** milestone; use it
when work moves from current thread-scale triage into scheduler architecture.

## Design Grounding Checklist

Before implementation slices, read:

- `docs/architecture/scheduling.md`
- `docs/backlog/smp-phase-c.md`
- `docs/proposals/smp-proposal.md`
- `docs/proposals/ring-v2-smp-proposal.md`
- `docs/proposals/tickless-realtime-scheduling-proposal.md`
- `docs/proposals/stateful-task-job-graphs-proposal.md`
- `docs/proposals/scheduler-evolution-proposal.md`
- `docs/research/future-scheduler-architecture.md`
- `docs/research/nohz-sqpoll-realtime.md`
- `docs/research/out-of-kernel-scheduling.md`
- `docs/research/completion-ring-threading.md`

For realtime or isolation slices, also read:

- `docs/research/multimedia-pipeline-latency.md`
- `docs/research/robotics-realtime-control.md`
- `docs/research/x2apic-and-virtualization.md`

## Phase A: Attribution and Guardrails

- [ ] Finish thread-scale attribution. First-pass scheduler
      candidate/outcome, reschedule-IPI, serial-byte, and scheduler-lock
      counters now exist behind `CAPOS_THREAD_SCALE_GUEST_MEASURE=1` and have
      been confirmed on `capos-bench`, and timer interrupt counters now exist
      in the same guest-measure path. First-pass CR3/TLB counters now exist in
      the same guest-measure path; remaining slices are guest-symbol or
      guest-PC samples, workload/cacheline A/B evidence, and
      logging-suppression A/B evidence.
- [ ] Add a benchmark-kernel mode that suppresses per-context-switch logging
      during measured cases so serial MMIO cannot masquerade as scheduler cost.
- [ ] Decide which counters are permanent observability and which stay behind
      `measure`.
- [ ] Record controlled `capos-bench` evidence before and after each scheduler
      structure change.

## Phase B: Per-CPU Runnable Ownership

- [ ] Define `PerCpuRunQueue` ownership invariants: one runnable owner per
      live generation-checked `ThreadRef`, no duplicate runnable placement,
      explicit migration state, and bounded steal path.
- [ ] Split current-thread and runnable ownership from shared process/thread
      metadata without widening emergency-path allocation.
- [ ] Add cross-CPU wake policy for endpoint, timer, park, process wait, and
      thread join completions.
- [ ] Add bounded reschedule IPI behavior for idle-to-runnable transitions.
- [ ] Preserve direct IPC handoff as a scheduling preference without bypassing
      per-CPU ownership or generation checks.
- [ ] Prove process/thread exit cleanup cannot leave a stale runnable entry on
      any CPU queue.
- [ ] Rerun `make run-thread-scale`, `make run-smp2-smokes`, ordinary smoke,
      spawn/thread, park, ring, and process-exit focused proofs.

## Phase C: CPU Accounting

- [ ] Add monotonic runtime charge points at context switch, preemption,
      blocking syscall, direct IPC handoff, unblock, and thread exit.
- [ ] Track per-thread runtime, virtual runtime seed, context switches,
      preemptions, voluntary blocks, and migrations.
- [ ] Add process/session/service aggregation only after the per-thread record
      has a single ledger of record.
- [ ] Add tests or QEMU diagnostics proving runtime increases while running and
      stops while blocked.
- [ ] Keep runtime accounting independent of tickless idle by using the
      monotonic clocksource layer.

## Phase D: Best-Effort Fair Scheduling

- [ ] Choose initial weighted-fair or EEVDF-like policy based on accounting and
      queue data.
- [ ] Add scheduler entity weights and latency class metadata through a
      capability-authorized policy path, not ambient process fields.
- [ ] Preserve fairness across CPU migration.
- [ ] Test CPU hogs, short sleepers, direct IPC server/client pairs,
      multi-process load, and same-process sibling load.
- [ ] Define overload behavior when runnable entities exceed the selected CPU
      set or when migration cannot keep up.

## Phase E: SchedulingContext Capability

- [ ] Define the first `SchedulingContext` object shape: budget, period,
      relative deadline, CPU mask, replenishment state, timeout endpoint, and
      overrun policy.
- [ ] Add capability creation/bind/revoke rules and generation identity.
- [ ] Enforce budget and replenishment in the kernel dispatcher.
- [ ] Add endpoint donation/return semantics for synchronous calls and passive
      services.
- [ ] Add timeout/depletion notifications with preallocated emergency-path
      storage.
- [ ] Prove stale scheduling contexts fail closed after revoke, process exit,
      and session logout.

## Phase F: CPU Isolation Lease and SQPOLL

- [ ] Define `CpuIsolationLease` authority separately from CPU-time budget.
- [ ] Add scheduler activation proof for housekeeping, deferred cleanup,
      timers, networking, IRQ affinity, accounting target, and revocation
      latency.
- [ ] Integrate SQPOLL ring mode only after one-SQ-consumer ring ownership is
      enforced.
- [ ] Add lease revocation on explicit revoke, process exit, service
      replacement, and session close.
- [ ] Add nohz activation/deactivation telemetry.

## Phase G: Realtime Islands

- [ ] Define `RealtimeIsland` admission inputs: scheduling contexts, memory
      reservations, device/IRQ reservations, communication paths, CPU leases,
      and overrun policy.
- [ ] Add a small local-audio or synthetic periodic-control proof before
      robotics or provider workloads.
- [ ] Prove no allocation, blocking endpoint call, paging, or logging on the
      admitted realtime path.
- [ ] Record deadline misses and overrun handling as observable output.

## Phase H: Policy Service

- [ ] Define a privileged scheduler policy service interface for admission,
      budget/profile updates, CPU lease grant/revoke, and diagnostics.
- [ ] Keep kernel fallback scheduling independent of policy-service liveness.
- [ ] Add manifest/config hooks for default profiles without making policy
      changes require kernel rebuilds.
- [ ] Add operator diagnostics that explain why a thread or island was denied,
      throttled, migrated, or revoked.
- [ ] Define how stateful task/job graph assignment metadata maps into
      scheduler policy inputs: graph priority to weight/latency class, graph
      deadline to request freshness or admission input, graph budget to
      `SchedulingContext` reference, and graph queue to policy-service
      placement. The graph coordinator must not mint CPU authority by itself.