Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SMP Phase C Backlog

Detailed context for the selected SMP Phase C AP scheduler-owner proof and the remaining full-concurrent-SMP and in-process thread-scaling follow-on work.

Visible Goal

Move from a single scheduler owner to multiple CPUs that can run independent scheduler-owned kernel/user work concurrently, and prove that capability-owned processes can improve wall-clock performance on a deterministic CPU-bound workload under QEMU/KVM.

This backlog tracks two distinct visible milestones:

  1. Multi-Process SMP Concurrency: make run-smp-process-scale should boot a focused manifest, run a deterministic SMP scaling demo across independent worker processes, print verified workload output, and report comparable 1/2/4-process timing. The proof is complete only when repeated KVM-backed -smp 1 and -smp 2 runs show near-linear speedup for the selected workload, while the ordinary manifest, ring, thread, park, and process-exit smokes still pass under -smp 2.
  2. In-Process Threading Scalability: make run-thread-scale should run a deterministic workload across sibling threads inside one process, verify the result, and report comparable 1/2/4-thread timing. This milestone depends on per-thread capability rings or equivalent completion routing so same-process threads are not serialized by the current process-wide cap_enter waiter.

Full concurrent SMP scheduling remains the underlying kernel goal for the multi-process milestone. It means more than one CPU can own scheduler work simultaneously, including per-CPU runnable ownership, cross-CPU idle-to-runnable handoff, reschedule IPIs, safe current-thread tracking, and reviewed lock/residency rules. The multi-process scaling demo is the first user-visible acceptance test for that kernel capability.

Completed Gates

  • Ground the multi-CPU scheduling slice in the SMP proposal, scheduler and threading docs, and relevant docs/research/ files.
  • Migrate syscall entry/exit to the GS-base/swapgs per-CPU path, including non-sysretq scheduler/exit paths.
  • Add LAPIC timer, EOI, and IPI support for per-CPU ticks and cross-CPU coordination. The active backend is PIT-calibrated xAPIC MMIO with PIT/PIC fallback; x2APIC remains a later backend.
  • Add TLB shootdown before any user address space can run on more than one CPU over its lifetime.
  • Extend scheduler state from BSP-only ownership to per-CPU current-thread tracking with AP idle/runnable handoff. The first AP scheduler proof uses one AP as scheduler owner while the BSP stays in kernel idle, preserving the process-wide ring invariant.
  • Add QEMU proof that AP cpu=1 executes scheduler-owned work and the existing manifest/ring/thread/park smokes still pass under -smp 2.

Multi-Process SMP Concurrency Gates

  • Split the current one-owner scheduler latch into per-CPU scheduler run queues or equivalent ownership that can keep more than one CPU executing scheduler-owned work at the same time.
  • Add reschedule IPIs for idle-to-runnable handoff across scheduler owners.
  • Prove concurrent scheduler-owned work on more than one CPU with independent worker processes first. This avoids process-wide capability ring races while still proving real multi-core execution.
  • Add an SMP scaling demo binary and focused manifest. The preferred first workload is a pure integer, deterministic, embarrassingly parallel algorithm such as segmented prime counting over generated ranges. It should partition work statically by worker index, avoid hot-path syscalls and serial output, produce a checksum or count that the parent verifies, and print one compact result line per run.
  • Add a host harness for make run-smp-process-scale that runs the same workload under -smp 1, -smp 2, and optionally -smp 4, captures raw logs, and reports worker count, CPU count, ticks or cycles, output checksum, and speedup. A single noisy QEMU run is not enough evidence for a scaling claim; keep raw repeated-run artifacts for review.
  • Treat near-linear 1-to-2 CPU speedup as the first publishable target. Use a threshold high enough to reject accidental concurrency illusions but low enough for QEMU/KVM variance, for example at least 1.6x median speedup over repeated runs. Record the exact threshold in the harness when this milestone is selected for implementation.

In-Process Threading Scalability Gates

  • Move capability-ring waiting/completion routing to the per-thread ThreadRef model before claiming same-process sibling threads scale independently on different CPUs.
  • Ensure thread creation, FS/TLS setup, thread exit, join, park waits, and process exit remain generation-checked and safe when sibling threads can be resident on different CPUs.
  • Add an in-process thread scaling demo that uses the same class of deterministic CPU-bound workload as the multi-process proof, but splits work across sibling threads in one process. It should verify the same checksum or count and print one compact result line per run.
  • Add a host harness for make run-thread-scale that runs 1/2/4-thread cases under matching QEMU CPU counts, captures raw logs, and rejects results where the runtime serializes the timed section through one process-wide ring owner.
  • Treat same-process speedup as a separate claim from multi-process SMP concurrency. Passing make run-smp-process-scale must not imply this milestone is complete.

Task Selection

Choose a task that isolates scheduler and CPU parallelism rather than a subsystem bottleneck. Both milestones should use workload shapes with these properties:

  • CPU-bound and deterministic, with no network, disk, terminal, or heap-heavy hot path.
  • Naturally partitionable into independent chunks so workers do not share a lock, mutable buffer, or capability ring while the timed section runs.
  • Verifiable by a compact checksum, count, or known-answer oracle.
  • Long enough to dominate boot, process spawn, timer granularity, and serial logging overhead.
  • Runnable as independent worker processes for the multi-process milestone, and runnable as sibling threads only after per-thread completion routing is implemented for the in-process milestone.

Avoid using IPC throughput, capability-ring dispatch, park wake storms, console logging, or allocator stress as the first SMP scaling claim. Those are valid later benchmarks, but they measure shared kernel bottlenecks as much as CPU scheduling. Same-process thread scaling is explicitly a separate later milestone because the current process-wide completion ring admits only one blocked cap_enter waiter per process.

Grounding Files

  • docs/proposals/smp-proposal.md
  • docs/proposals/ring-v2-smp-proposal.md
  • docs/architecture/scheduling.md
  • docs/architecture/threading.md
  • docs/research/completion-ring-threading.md
  • docs/research/out-of-kernel-scheduling.md
  • docs/research/sel4.md
  • docs/research/zircon.md
  • docs/research/x2apic-and-virtualization.md

Notes

Initial multi-CPU scheduling may keep the current process ring while the runtime serializes process-ring consumption. Full SMP where sibling threads from one process wait independently on different CPUs should not keep the process-wide CQ as the kernel ABI endpoint. The target transport model is per-thread capability rings: cap_enter(min_complete, timeout_ns) waits on the current thread’s CQ, kernel waiters route completions by generation-checked ThreadRef, and SQPOLL becomes a per-ring mode with one kernel SQ consumer.

SharedParkSpace park-words still need MemoryObject mapping provenance or object pins before shared-key derivation lands.

2026-04-25 11:36 UTC: commit d88bca7 recorded the First AP Scheduler proof. AP cpu=1 can run scheduler-owned user contexts under -smp 2, and a one-way scheduler-owner latch prevents the BSP and AP from both entering scheduler-owned user work while the process-wide ring remains the active transport.