SMP Phase C Backlog
Detailed context for the selected SMP Phase C AP scheduler-owner proof and the remaining full-concurrent-SMP and in-process thread-scaling follow-on work.
Visible Goal
Move from a single scheduler owner to multiple CPUs that can run independent scheduler-owned kernel/user work concurrently, and prove that capability-owned processes can improve wall-clock performance on a deterministic CPU-bound workload under QEMU/KVM.
This backlog tracks two distinct visible milestones:
- Multi-Process SMP Concurrency:
make run-smp-process-scaleshould boot a focused manifest, run a deterministic SMP scaling demo across independent worker processes, print verified workload output, and report comparable 1/2/4-process timing. The proof is complete only when repeated KVM-backed-smp 1and-smp 2runs show near-linear speedup for the selected workload, while the ordinary manifest, ring, thread, park, and process-exit smokes still pass under-smp 2. - In-Process Threading Scalability:
make run-thread-scaleshould run a deterministic workload across sibling threads inside one process, verify the result, and report comparable 1/2/4-thread timing. This milestone depends on per-thread capability rings or equivalent completion routing so same-process threads are not serialized by the current process-widecap_enterwaiter.
Full concurrent SMP scheduling remains the underlying kernel goal for the multi-process milestone. It means more than one CPU can own scheduler work simultaneously, including per-CPU runnable ownership, cross-CPU idle-to-runnable handoff, reschedule IPIs, safe current-thread tracking, and reviewed lock/residency rules. The multi-process scaling demo is the first user-visible acceptance test for that kernel capability.
Completed Gates
- Ground the multi-CPU scheduling slice in the SMP proposal, scheduler and
threading docs, and relevant
docs/research/files. - Migrate syscall entry/exit to the GS-base/
swapgsper-CPU path, including non-sysretqscheduler/exit paths. - Add LAPIC timer, EOI, and IPI support for per-CPU ticks and cross-CPU coordination. The active backend is PIT-calibrated xAPIC MMIO with PIT/PIC fallback; x2APIC remains a later backend.
- Add TLB shootdown before any user address space can run on more than one CPU over its lifetime.
- Extend scheduler state from BSP-only ownership to per-CPU current-thread tracking with AP idle/runnable handoff. The first AP scheduler proof uses one AP as scheduler owner while the BSP stays in kernel idle, preserving the process-wide ring invariant.
- Add QEMU proof that AP cpu=1 executes scheduler-owned work and the
existing manifest/ring/thread/park smokes still pass under
-smp 2.
Multi-Process SMP Concurrency Gates
- Split the current one-owner scheduler latch into per-CPU scheduler run queues or equivalent ownership that can keep more than one CPU executing scheduler-owned work at the same time.
- Add reschedule IPIs for idle-to-runnable handoff across scheduler owners.
- Prove concurrent scheduler-owned work on more than one CPU with independent worker processes first. This avoids process-wide capability ring races while still proving real multi-core execution.
- Add an SMP scaling demo binary and focused manifest. The preferred first workload is a pure integer, deterministic, embarrassingly parallel algorithm such as segmented prime counting over generated ranges. It should partition work statically by worker index, avoid hot-path syscalls and serial output, produce a checksum or count that the parent verifies, and print one compact result line per run.
- Add a host harness for
make run-smp-process-scalethat runs the same workload under-smp 1,-smp 2, and optionally-smp 4, captures raw logs, and reports worker count, CPU count, ticks or cycles, output checksum, and speedup. A single noisy QEMU run is not enough evidence for a scaling claim; keep raw repeated-run artifacts for review. - Treat near-linear 1-to-2 CPU speedup as the first publishable target. Use a threshold high enough to reject accidental concurrency illusions but low enough for QEMU/KVM variance, for example at least 1.6x median speedup over repeated runs. Record the exact threshold in the harness when this milestone is selected for implementation.
In-Process Threading Scalability Gates
- Move capability-ring waiting/completion routing to the per-thread
ThreadRefmodel before claiming same-process sibling threads scale independently on different CPUs. - Ensure thread creation, FS/TLS setup, thread exit, join, park waits, and process exit remain generation-checked and safe when sibling threads can be resident on different CPUs.
- Add an in-process thread scaling demo that uses the same class of deterministic CPU-bound workload as the multi-process proof, but splits work across sibling threads in one process. It should verify the same checksum or count and print one compact result line per run.
- Add a host harness for
make run-thread-scalethat runs 1/2/4-thread cases under matching QEMU CPU counts, captures raw logs, and rejects results where the runtime serializes the timed section through one process-wide ring owner. - Treat same-process speedup as a separate claim from multi-process SMP
concurrency. Passing
make run-smp-process-scalemust not imply this milestone is complete.
Task Selection
Choose a task that isolates scheduler and CPU parallelism rather than a subsystem bottleneck. Both milestones should use workload shapes with these properties:
- CPU-bound and deterministic, with no network, disk, terminal, or heap-heavy hot path.
- Naturally partitionable into independent chunks so workers do not share a lock, mutable buffer, or capability ring while the timed section runs.
- Verifiable by a compact checksum, count, or known-answer oracle.
- Long enough to dominate boot, process spawn, timer granularity, and serial logging overhead.
- Runnable as independent worker processes for the multi-process milestone, and runnable as sibling threads only after per-thread completion routing is implemented for the in-process milestone.
Avoid using IPC throughput, capability-ring dispatch, park wake storms,
console logging, or allocator stress as the first SMP scaling claim. Those are
valid later benchmarks, but they measure shared kernel bottlenecks as much as
CPU scheduling. Same-process thread scaling is explicitly a separate later
milestone because the current process-wide completion ring admits only one
blocked cap_enter waiter per process.
Grounding Files
docs/proposals/smp-proposal.mddocs/proposals/ring-v2-smp-proposal.mddocs/architecture/scheduling.mddocs/architecture/threading.mddocs/research/completion-ring-threading.mddocs/research/out-of-kernel-scheduling.mddocs/research/sel4.mddocs/research/zircon.mddocs/research/x2apic-and-virtualization.md
Notes
Initial multi-CPU scheduling may keep the current process ring while the
runtime serializes process-ring consumption. Full SMP where sibling threads
from one process wait independently on different CPUs should not keep the
process-wide CQ as the kernel ABI endpoint. The target transport model is
per-thread capability rings: cap_enter(min_complete, timeout_ns) waits on the
current thread’s CQ, kernel waiters route completions by generation-checked
ThreadRef, and SQPOLL becomes a per-ring mode with one kernel SQ consumer.
SharedParkSpace park-words still need MemoryObject mapping provenance or object pins before shared-key derivation lands.
2026-04-25 11:36 UTC: commit d88bca7 recorded the First AP Scheduler proof.
AP cpu=1 can run scheduler-owned user contexts under -smp 2, and a one-way
scheduler-owner latch prevents the BSP and AP from both entering
scheduler-owned user work while the process-wide ring remains the active
transport.