# SMP Phase C Backlog

Detailed context for the selected SMP Phase C AP scheduler-owner proof and the
remaining full-concurrent-SMP and in-process thread-scaling follow-on work.

## Visible Goal

Move from a single scheduler owner to multiple CPUs that can run independent
scheduler-owned kernel/user work concurrently, and prove that capability-owned
processes can improve wall-clock performance on a deterministic CPU-bound
workload under QEMU/KVM.

This backlog tracks two distinct visible milestones:

1. **Multi-Process SMP Concurrency**: `make run-smp-process-scale` should boot
   a focused manifest, run a deterministic SMP scaling demo across independent
   worker processes, print verified workload output, and report comparable
   1/2/4-process timing. The proof is complete only when repeated KVM-backed
   `-smp 1` and `-smp 2` runs show near-linear speedup for the selected
   workload, while the ordinary manifest, ring, thread, park, and process-exit
   smokes still pass under `-smp 2`.
2. **In-Process Threading Scalability**: `make run-thread-scale` should run a
   deterministic workload across sibling threads inside one process, verify the
   result, and report comparable 1/2/4-thread timing. This milestone depends on
   per-thread capability rings or equivalent completion routing so same-process
   threads are not serialized by the current process-wide `cap_enter` waiter.

Full concurrent SMP scheduling remains the underlying kernel goal for the
multi-process milestone. It means more than one CPU can own scheduler work
simultaneously, including per-CPU runnable ownership, cross-CPU
idle-to-runnable handoff, reschedule IPIs, safe current-thread tracking, and
reviewed lock/residency rules. The multi-process scaling demo is the first
user-visible acceptance test for that kernel capability.

## Completed Gates

- [x] Ground the multi-CPU scheduling slice in the SMP proposal, scheduler and
      threading docs, and relevant `docs/research/` files.
- [x] Migrate syscall entry/exit to the GS-base/`swapgs` per-CPU path,
      including non-`sysretq` scheduler/exit paths.
- [x] Add LAPIC timer, EOI, and IPI support for per-CPU ticks and cross-CPU
      coordination. The active backend is PIT-calibrated xAPIC MMIO with
      PIT/PIC fallback; x2APIC remains a later backend.
- [x] Add TLB shootdown before any user address space can run on more than one
      CPU over its lifetime.
- [x] Extend scheduler state from BSP-only ownership to per-CPU current-thread
      tracking with AP idle/runnable handoff. The first AP scheduler proof uses
      one AP as scheduler owner while the BSP stays in kernel idle, preserving
      the process-wide ring invariant.
- [x] Add QEMU proof that AP cpu=1 executes scheduler-owned work and the
      existing manifest/ring/thread/park smokes still pass under `-smp 2`.

## Multi-Process SMP Concurrency Gates

- [ ] Split the current one-owner scheduler latch into per-CPU scheduler run
      queues or equivalent ownership that can keep more than one CPU executing
      scheduler-owned work at the same time.
- [ ] Add reschedule IPIs for idle-to-runnable handoff across scheduler owners.
- [ ] Prove concurrent scheduler-owned work on more than one CPU with
      independent worker processes first. This avoids process-wide capability
      ring races while still proving real multi-core execution.
- [ ] Add an SMP scaling demo binary and focused manifest. The preferred first
      workload is a pure integer, deterministic, embarrassingly parallel
      algorithm such as segmented prime counting over generated ranges. It
      should partition work statically by worker index, avoid hot-path syscalls
      and serial output, produce a checksum or count that the parent verifies,
      and print one compact result line per run.
- [ ] Add a host harness for `make run-smp-process-scale` that runs the same workload
      under `-smp 1`, `-smp 2`, and optionally `-smp 4`, captures raw logs, and
      reports worker count, CPU count, ticks or cycles, output checksum, and
      speedup. A single noisy QEMU run is not enough evidence for a scaling
      claim; keep raw repeated-run artifacts for review.
- [ ] Treat near-linear 1-to-2 CPU speedup as the first publishable target.
      Use a threshold high enough to reject accidental concurrency illusions
      but low enough for QEMU/KVM variance, for example at least 1.6x median
      speedup over repeated runs. Record the exact threshold in the harness
      when this milestone is selected for implementation.

## In-Process Threading Scalability Gates

- [ ] Move capability-ring waiting/completion routing to the per-thread
      `ThreadRef` model before claiming same-process sibling threads scale
      independently on different CPUs.
- [ ] Ensure thread creation, FS/TLS setup, thread exit, join, park waits,
      and process exit remain generation-checked and safe when sibling threads
      can be resident on different CPUs.
- [ ] Add an in-process thread scaling demo that uses the same class of
      deterministic CPU-bound workload as the multi-process proof, but splits
      work across sibling threads in one process. It should verify the same
      checksum or count and print one compact result line per run.
- [ ] Add a host harness for `make run-thread-scale` that runs 1/2/4-thread
      cases under matching QEMU CPU counts, captures raw logs, and rejects
      results where the runtime serializes the timed section through one
      process-wide ring owner.
- [ ] Treat same-process speedup as a separate claim from multi-process SMP
      concurrency. Passing `make run-smp-process-scale` must not imply this
      milestone is complete.

## Task Selection

Choose a task that isolates scheduler and CPU parallelism rather than a
subsystem bottleneck. Both milestones should use workload shapes with these
properties:

- CPU-bound and deterministic, with no network, disk, terminal, or heap-heavy
  hot path.
- Naturally partitionable into independent chunks so workers do not share a
  lock, mutable buffer, or capability ring while the timed section runs.
- Verifiable by a compact checksum, count, or known-answer oracle.
- Long enough to dominate boot, process spawn, timer granularity, and serial
  logging overhead.
- Runnable as independent worker processes for the multi-process milestone,
  and runnable as sibling threads only after per-thread completion routing is
  implemented for the in-process milestone.

Avoid using IPC throughput, capability-ring dispatch, park wake storms,
console logging, or allocator stress as the first SMP scaling claim. Those are
valid later benchmarks, but they measure shared kernel bottlenecks as much as
CPU scheduling. Same-process thread scaling is explicitly a separate later
milestone because the current process-wide completion ring admits only one
blocked `cap_enter` waiter per process.

## Grounding Files

- `docs/proposals/smp-proposal.md`
- `docs/proposals/ring-v2-smp-proposal.md`
- `docs/architecture/scheduling.md`
- `docs/architecture/threading.md`
- `docs/research/completion-ring-threading.md`
- `docs/research/out-of-kernel-scheduling.md`
- `docs/research/sel4.md`
- `docs/research/zircon.md`
- `docs/research/x2apic-and-virtualization.md`

## Notes

Initial multi-CPU scheduling may keep the current process ring while the
runtime serializes process-ring consumption. Full SMP where sibling threads
from one process wait independently on different CPUs should not keep the
process-wide CQ as the kernel ABI endpoint. The target transport model is
per-thread capability rings: `cap_enter(min_complete, timeout_ns)` waits on the
current thread's CQ, kernel waiters route completions by generation-checked
`ThreadRef`, and SQPOLL becomes a per-ring mode with one kernel SQ consumer.

SharedParkSpace park-words still need MemoryObject mapping provenance or object
pins before shared-key derivation lands.

2026-04-25 11:36 UTC: commit `d88bca7` recorded the First AP Scheduler proof.
AP cpu=1 can run scheduler-owned user contexts under `-smp 2`, and a one-way
scheduler-owner latch prevents the BSP and AP from both entering
scheduler-owned user work while the process-wide ring remains the active
transport.