# Proposal: Symmetric Multi-Processing (SMP)

How capOS goes from single-CPU execution to utilizing all available processors.

## Grounding and Cross-Links

The SMP substrate is one half of capOS's multicore story; scheduler policy
above it is the other half, and they advance through coupled gates. Read this
proposal together with:

- [Scheduler Evolution](scheduler-evolution-proposal.md) -- Phase D (per-CPU
  WFQ, bounded stealing) and Phase E (`SchedulingContext` bind/revoke, budget,
  donation/return, depletion notification) are closed; Phase F has landed the
  one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work
  placement, bounded SQPOLL ring mode, the clockevent/deadline substrate, and
  bounded non-periodic SQPOLL producer-wake progress, the first
  automatic nohz activation increment closed via
  [`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-activation.md),
  and SQPOLL-driven auto-nohz activation closed via
  [`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-sqpoll.md);
  timeout-based auto-revoke and ordinary-thread generic full-nohz admission are
  also landed; generic SQPOLL nohz for arbitrary rings and policy-service
  AutoNoHz issuance remain future work;
  **Phase F.5 (full-SMP 16/32-core scalability planning) is the named gate for
  the milestone described below in [Full-SMP Scalability
  Milestone](#full-smp-scalability-milestone)** and remains planning, not
  closed.
- [In-Process Threading Contract](../architecture/threading.md) -- thread-owned
  execution state, generation-checked `ThreadRef` queues and wake records,
  per-thread ring mappings, and the recorded same-process 1-to-2 / diagnostic
  1-to-4 evidence rows that this proposal's scalability work must keep
  honoring.
- [Design Risks Register, Q9 -- CPU accounting and scheduling
  contexts](../design-risks-register.md#q9--cpu-accounting-and-scheduling-contexts)
  -- partial-status answer that covers per-CPU WFQ, Phase E
  `SchedulingContext`, and the cross-service donation / nohz activation /
  isolation lease / cross-principal fairness work still open.
- [Ring v2 For Full SMP](ring-v2-smp-proposal.md) -- per-thread ring
  endpoints and `cap_enter`-on-thread-CQ are the dispatch contract this
  proposal's scheduler-ownership milestones rely on.
- [SMP Phase C backlog](../backlog/smp-phase-c.md) -- decomposed task list for
  the in-progress Phase C work tracked below.

The migrated task
[`kernel-upper-half-pml4-propagation-hardening`](../tasks/done/2026-06-07/kernel-upper-half-pml4-propagation-hardening.md)
carries the Phase C residual for kernel upper-half page-table mutation after AP
startup. The retained finding is closed for the current kernel
MMIO/firmware helper path: `paging::init()` pre-seeds the helper's upper-half
PML4 slot, `AddressSpace::new_user` clones upper-half entries from the
synchronized kernel root under the kernel page-table lock, and
`map_kernel_physical_range` rejects any attempt to create a previously absent
kernel-half PML4 slot after a user address space has been created. User-side
`AddressSpace::{map,unmap,protect}` remains shootdown-aware against resident
CPU masks; kernel upper-half edits inside pre-existing slots use the
kernel-wide shootdown path. Future helper windows or allocator-growth paths
that would require a new upper-half PML4 slot must pre-seed that slot before
user address-space creation or add synchronized active propagation into live
address spaces.

This document has three phases: a **per-CPU foundation** (prerequisite
plumbing), **AP startup** (bringing secondary CPUs online), and **SMP
correctness** (making shared state safe under concurrency).

Current status: Phase A's BSP per-CPU foundation and Phase B AP startup are
complete. Phase C has completed syscall GS migration, LAPIC/IPI, TLB
shootdown, the first AP scheduler-owner handoff, temporary scheduler ownership
on CPUs 0-3, per-CPU WFQ runnable queues under the shared scheduler lock,
bounded stealing, and bounded idle-to-runnable wake targeting for queued and
direct-IPC wakeups. The current scheduler is no longer the temporary
single-global-runnable-queue shape from the 2026-05-02 collapse. Remaining
SMP risks are the shared scheduler lock, temporary pinning replacement,
scheduler-driven AP idle policy, broader workload classes, and
higher-thread-count evidence. The next SMP product-level milestone should be
full-SMP scalability evidence on a real 16/32-core environment, with QEMU kept
for boot and regression coverage rather than as the primary performance source.

Implementation checkpoint: the BSP now has a concrete `PerCpu` object with
stable syscall-stack offsets, and syscall entry uses `KernelGsBase`/`swapgs`
to reach the per-CPU kernel RSP and saved user RSP slots. The scheduler mirrors
its current `ThreadRef` into the BSP record.

Second checkpoint: runtime stack switches now flow through
`percpu::set_kernel_entry_stack`, which updates the BSP `PerCpu.kernel_rsp`
slot and the BSP TSS.RSP0 together. Scheduler and interrupt paths no longer
coordinate those two updates by calling separate GDT and syscall helpers.

Third checkpoint: `kernel/src/arch/x86_64/smp.rs` now issues the Limine
`MpRequest`, enumerates non-BSP CPUs, allocates AP-local `PerCpu` records and
kernel/IST stack storage, and records dense capOS CPU ids separately from Limine
processor and LAPIC ids.

Fourth checkpoint: APs now start through `MpInfo::bootstrap()` and reach a
parked kernel idle loop. The BSP passes an AP record pointer through Limine
`extra_argument`, waits for a bounded online count, and remains the only CPU
that schedules userspace. Each AP loads AP-owned GDT/TSS state, the shared IDT,
`KernelGsBase`, and syscall MSRs, reports online, disables interrupts, and
parks in `hlt`. Review tightened this checkpoint so APs first switch from
Limine handoff state to the capOS kernel PML4 and AP-owned kernel stack before
any online signal.

Fifth checkpoint: syscall entry/exit now runs with kernel GS active between
entry and return. Normal returns swap back before `sysretq`, and blocking or
exiting syscall paths that leave through scheduler `iretq` restore use a
dedicated trampoline to swap GS back before restoring the next user context.

Sixth checkpoint: the BSP now enables xAPIC MMIO, maps the LAPIC page through
the kernel MMIO allocator, calibrates the LAPIC timer initial count against PIT
channel 2, runs scheduler ticks through LAPIC timer vector 48 with LAPIC EOI,
installs the LAPIC spurious vector, and masks the legacy PIC once LAPIC ticks
are active. Parked APs initialize local APIC state before reporting online. IDT
vector 49 and a bounded vector-49-only fixed IPI send primitive back TLB
shootdown and bounded idle-to-runnable reschedule requests.

Seventh checkpoint: user page-table `map`, `unmap`, and `protect` now flush the
local CPU and then route through a serialized vector-49 TLB shootdown helper
using each `AddressSpace`'s resident CPU mask. The helper records pending
full-TLB flush generations and sends vector-49 IPIs to online resident CPUs
other than the caller, then returns a completion token that callers wait after
dropping ring dispatch locks. Scheduler CR3 handoff points mark the selected
address space resident on the current CPU.

Eighth checkpoint: scheduler current-thread state is split into per-CPU slots,
AP `PerCpu` records are registered for current-thread and kernel-entry stack
updates, AP TSS.RSP0 is updated during context switches, and AP cpu=1 can enter
the scheduler from the AP idle loop when its LAPIC timer is available. The
first AP proof intentionally keeps one scheduler owner: when AP cpu=1 is online
with a programmed timer, the BSP remains in kernel idle so the process-wide
capability ring is not executed concurrently. The scheduler idle path is now a
per-CPU CPL0 (kernel-mode) idle thread; the user-mode idle process was removed
in commit e3c0df01 (2026-05-14 UTC). "Kernel idle" throughout this proposal
refers to that per-CPU CPL0 idle thread, not a user-mode idle process.

**Depends on:** Stage 5 (Scheduling) -- needs a working timer, context switch,
and run queue on the BSP before adding more CPUs.

**Phase B completion:** AP startup is implemented and reviewed. The private
process-buffer `validate_user_buffer`
TOCTOU blocker is closed for single locked copy/read paths, and Phase A now
has the BSP running through concrete per-CPU syscall-stack/current-thread
state. TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler
ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and
bounded idle-to-runnable wake targeting are implemented; shared scheduler lock
contention, temporary pinning replacement, scheduler-driven AP idle policy,
broader workload classes, higher-thread-count evidence, and shared
SharedParkSpace park key derivation remain later Stage 7 work. Shared
keys still need MemoryObject mapping provenance or object pins before they can
keep backing stable beyond one address-space-locked access.

---

## Full-SMP Scalability Milestone

The current SMP evidence reaches four physical-core workers and one
eight-logical-CPU SMT run under QEMU/KVM. That was enough to expose scheduler
structure problems, but it is not the shape that should define whether capOS
really uses modern multicore machines. The next SMP milestone should answer a
more concrete question: can ordinary capOS workloads keep useful throughput and
bounded scheduler overhead as the machine scales to 16 and 32 physical cores?

Preferred evidence environment:

- direct capOS boot on a dedicated bare-metal or cloud bare-metal/perf-runner
  machine with at least 16 physical cores, and a 32-core row when hardware is
  available;
- recorded CPU topology, SMT state, APIC mode, timer source, frequency policy,
  memory size, firmware/device model, source commit, toolchain, and kernel
  configuration;
- Linux native baselines on the same machine for comparable CPU workloads;
- QEMU/KVM rows only for boot/regression continuity or for explicitly labeled
  virtualized comparisons.

Workload coverage should move beyond one fixed checksum row:

- static map/reduce checksum over equal byte ranges;
- uneven dynamic task pool with deterministic task ids and result hash;
- barrier-heavy phase loop that exposes wakeup and cross-CPU coordination cost;
- same-process thread workload and independent-process workload;
- IPC/service-bound worker workload that includes capability calls outside the
  timed compute loop.

Each workload should report 1, 2, 4, 8, 16, and 32-worker rows when the
hardware supports those counts, with SMT rows separated from physical-core
rows. Each row should include both work-window time and total time, run count,
warmup policy, median, variance, and verifier output. The report should show
speedup and efficiency curves instead of reducing the result to one boolean
threshold.

Implementation work expected before this milestone:

- replace the temporary scheduler CPU mask and static four-owner assumptions
  with discovered CPU topology and dynamic per-CPU scheduler structures;
- decide xAPIC versus x2APIC backend selection for larger APIC-id spaces;
- split or otherwise shrink the shared scheduler-lock critical sections that
  still serialize queue selection, wakeups, blocking, and cleanup;
- make placement topology-aware enough to distinguish physical cores, SMT
  siblings, and later NUMA/cache groups;
- keep TLB shootdown, timer, reschedule-IPI, cleanup, and accounting costs
  observable per CPU and per workload phase;
- keep per-thread ring ownership and SQ-consumer ownership generation-checked
  as CPU count rises.

This milestone belongs with scheduler evolution and benchmark planning rather
than a new standalone proposal: the SMP proposal defines the CPU substrate,
[Scheduler Evolution Phase F.5](scheduler-evolution-proposal.md) defines
dispatch and policy work for full-SMP 16/32-core scalability, the benchmark
proposal defines artifact shape, and the HPC parallel-pattern proposal defines
the workload matrix. Q9 in the
[design risks register](../design-risks-register.md#q9--cpu-accounting-and-scheduling-contexts)
is the matching open-question entry: base CPU accounting and scheduling-context
authority through Phase E are implemented, while cross-service donation, full
nohz activation, CPU isolation leases, and cross-principal fairness are the
named follow-ons that this milestone's evidence will be evaluated against.

## Current State

APs can boot into kernel idle loops, and CPUs 0-3 can temporarily own
scheduler/user work when their LAPIC timers are available. Specific
assumptions that Phase C must still remove:

| Component | File | Assumption |
|---|---|---|
| Syscall stack switching | `kernel/src/arch/x86_64/syscall.rs`, `kernel/src/arch/x86_64/percpu.rs` | Syscall entry/exit uses `KernelGsBase`/`swapgs` and GS-relative `PerCpu` stack fields on the running CPU |
| AP GDT, TSS, kernel stacks | `kernel/src/arch/x86_64/gdt.rs`, `kernel/src/arch/x86_64/smp.rs` | AP-local descriptor tables and stacks exist, and AP TSS.RSP0 updates during AP scheduler context switches |
| IDT | `kernel/src/arch/x86_64/idt.rs` | Single static IDT (shareable -- IDT can be the same across CPUs) |
| SYSCALL MSRs | `kernel/src/arch/x86_64/syscall.rs`, `kernel/src/arch/x86_64/smp.rs` | STAR/LSTAR/SFMASK/EFER are initialized on BSP and parked APs; BSP and AP startup both publish `KernelGsBase` |
| Current thread and run queues | `kernel/src/sched.rs`, `kernel/src/arch/x86_64/percpu.rs` | `SCHEDULER` owns per-CPU current slots, per-CPU WFQ runnable queues ordered by `virtual_finish_ns`, bounded stealing from sibling queues, and wake placement through `WakePolicy::QueueCpu`; queued and direct-IPC wakeups iterate eligible idle scheduler CPUs and wake the first that accepts a fresh reschedule IPI, and CPUs 0-3 can temporarily own scheduler/user execution when their LAPIC timers are available, while shared-lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred |
| Timer/IPI delivery | `kernel/src/arch/x86_64/context.rs`, `kernel/src/arch/x86_64/lapic.rs`, `kernel/src/arch/x86_64/pic.rs`, `kernel/src/arch/x86_64/pit.rs`, `kernel/src/arch/x86_64/tlb.rs` | CPUs 0-3 use PIT-calibrated LAPIC timer vector 48 with LAPIC EOI when online; vector 49 services TLB shootdown and bounded reschedule requests |
| Frame allocator | `kernel/src/mem/frame.rs` | Single global `ALLOCATOR` behind one spinlock |
| Heap allocator | `kernel/src/mem/heap.rs` | `linked_list_allocator` behind one spinlock |

The first checkpoint removed the separate syscall RSP globals and made the BSP
`PerCpu` layout the owner of syscall stack state. The GS checkpoint now uses
`KernelGsBase`/`swapgs` for those offsets on syscall paths. The LAPIC checkpoint
removed the PIT/PIC interrupt dependency from the normal BSP scheduler tick,
kept PIT channel 2 as the LAPIC calibration source, installed the spurious
vector, and wired the IPI vector. The TLB checkpoint added resident CPU masks,
vector-49 shootdown, pending generation counters, completion waits, and
syscall-entry plus flush-before-user-return hooks for delayed maskable interrupt
delivery. The AP scheduler-owner checkpoint added per-CPU current slots and AP
cpu=1 scheduler entry. The remaining Phase C assumptions are in concurrent
run-queue ownership and reschedule routing, not in syscall stack lookup, the
primary timer source, user page-table mutation invalidation, or AP TSS updates.

---

## Phase A: Per-CPU Foundation

Establish per-CPU data structures on the BSP. No APs are started yet -- this
phase makes the BSP's own code SMP-ready so Phase B is a clean addition.

### Per-CPU Data Region

Each CPU needs a private data area accessible via the GS segment base. On
x86_64, `swapgs` switches between user-mode GS (usually zero) and
kernel-mode GS (pointing to per-CPU data). The kernel sets `KernelGSBase`
MSR on each CPU during init.

The BSP checkpoint originally reached this layout as `BSP_PER_CPU+offset` from
assembly. Phase C now uses the same offsets through GS after `swapgs` on
syscall entry.

```rust
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
    /// Self-pointer for accessing the struct from GS:0.
    self_ptr: *const PerCpu,
    /// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
    kernel_rsp: u64,
    /// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
    user_rsp: u64,
    /// Currently running thread on this CPU, if one is active.
    current_thread: Option<ThreadRef>,
    /// CPU index (0 = BSP).
    cpu_id: u32,
    /// LAPIC ID (from Limine MP info or CPUID).
    lapic_id: u32,
}
```

The previous checkpointed syscall entry stub used the same offsets via the BSP
symbol:

```asm
movq %rsp, BSP_PER_CPU+16(%rip) ; PerCpu.user_rsp
movq BSP_PER_CPU+8(%rip), %rsp  ; PerCpu.kernel_rsp
```

The current syscall entry stub uses GS-relative addressing:

```asm
swapgs
movq %rsp, %gs:16          ; PerCpu.user_rsp
movq %gs:8, %rsp           ; PerCpu.kernel_rsp
```

And symmetrically on return:

```asm
movq %gs:16, %rsp          ; restore user RSP
swapgs
sysretq
```

Non-returning syscall paths need separate handling: `exit`, a blocking
`cap_enter`, and a terminal `ThreadControl.exitThread` can leave the syscall
entry path by building a `CpuContext` and restoring another thread with
`iretq`. Those paths must restore user GS ownership before `iretq`, even though
they never execute the normal `sysretq` epilogue.

### Lock And Ownership Rules

`PerCpu` fields split by owner:

- `kernel_rsp` and `TSS.RSP0` are updated together through
  `percpu::set_kernel_entry_stack`.
- `user_rsp` is written only by syscall entry assembly and read only while
  constructing a blocked-syscall `CpuContext`.
- `current_thread` mirrors `Scheduler.current`; the scheduler lock remains
  the authority for choosing and validating the current thread.
- `cpu_id` and `lapic_id` are immutable after CPU initialization.

Phase A keeps the global scheduler lock and process table. The `PerCpu`
current field is not a second scheduler authority; it is the per-CPU execution
cache that Phase B will use when multiple CPUs stop sharing one `current`
slot.

### Per-CPU GDT, TSS, and Stacks

Each CPU needs its own:

- **GDT** -- the TSS descriptor encodes a physical pointer to the CPU's
  TSS, so each CPU needs a GDT with its own TSS entry. The segment layout
  (kernel CS/DS, user CS/DS) is identical across CPUs.
- **TSS** -- `privilege_stack_table[0]` (kernel stack for interrupts from
  Ring 3) and IST entries (double-fault stack) must be per-CPU.
- **Kernel stack** -- each CPU needs its own stack for syscall/interrupt
  handling. Current size: 16 KB (4 pages). Same size per CPU.
- **Double-fault stack** -- each CPU needs its own IST stack. Current size:
  20 KB (5 pages).

```rust
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
    // Allocate kernel stack (4 pages) and double-fault stack (5 pages)
    let kernel_stack = alloc_stack(4);
    let df_stack = alloc_stack(5);

    // Create TSS with per-CPU stacks
    let mut tss = TaskStateSegment::new();
    tss.privilege_stack_table[0] = kernel_stack.top();
    tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();

    // Create GDT with this CPU's TSS
    let (gdt, selectors) = create_gdt(&tss);

    // Allocate and populate PerCpu struct
    let per_cpu = Box::leak(Box::new(PerCpu {
        self_ptr: core::ptr::null(),  // filled below
        kernel_rsp: kernel_stack.top().as_u64(),
        user_rsp: 0,
        current_thread: None,
        cpu_id,
        lapic_id,
    }));
    per_cpu.self_ptr = per_cpu as *const PerCpu;
    per_cpu
}
```

### LAPIC Initialization

Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for
preemption on the BSP. AP startup must initialize enough local-APIC state for
secondary CPUs to park in a kernel idle loop and for later IPIs. Migrating BSP
preemption from PIT to LAPIC timer is still required before multi-CPU
scheduling, since the PIT is a single shared device that cannot provide
per-CPU timer interrupts. LAPIC work is needed for:

- **Per-CPU timer** -- replace PIT with LAPIC timer (required for SMP)
- **IPI** -- inter-processor interrupts for TLB shootdown and AP startup
- **Spurious interrupt vector** -- must be configured per-CPU

2026-04-25 research decision: the immediate Phase C LAPIC/IPI foundation uses
xAPIC MMIO, LAPIC timer vector 48, IPI vector 49, LAPIC EOI, AP LAPIC
initialization, and PIT/PIC fallback. The grounding note
[x2APIC and APIC virtualization](../research/x2apic-and-virtualization.md)
records the checked Intel and QEMU/KVM sources and keeps x2APIC as a later
backend rather than a reason to rework the current LAPIC gate.

### Crate Dependencies

| Crate | Purpose | no_std |
|---|---|---|
| manual xAPIC MMIO backend | current LAPIC timer, EOI, IPI, spurious vector foundation | yes |
| future manual x2APIC MSR backend using `x86_64` MSR access | newer/high-core systems and firmware states where xAPIC is unavailable or undesirable | yes |

The current LAPIC path uses xAPIC MMIO through the kernel MMIO mapper. The
later x2APIC backend should still be small and explicit rather than adding an
APIC abstraction crate: read the APIC ID, enable x2APIC through
`IA32_APIC_BASE`, program the spurious-vector register, local-vector timer,
timer divide/initial-count registers, EOI, and ICR sends through MSRs. I/O APIC
remains separate MMIO hardware discovered through ACPI MADT and belongs to the
later interrupt-infrastructure/cloud path.

### Migration Path

Phase A was a refactor of existing single-CPU code, not an addition:

1. Add `PerCpu` struct, allocate one instance for BSP. **Done for BSP static
   storage.**
2. Set BSP's `KernelGSBase` MSR, add `swapgs` to syscall entry/exit.
   **Done for syscall entry/exit, including syscall-to-`iretq` exits.**
3. Replace `SYSCALL_KERNEL_RSP`/`SYSCALL_USER_RSP` globals with per-CPU
   accesses. **Done; syscall assembly uses GS-relative `PerCpu` offsets.**
4. Replace scheduler's global `SCHEDULER.current` with `PerCpu.current_thread`.
   **Partially done: the BSP per-CPU record mirrors `Scheduler.current`; the
   scheduler lock remains authoritative for current-thread and queue ownership
   until shared scheduler metadata is split further.**
5. Move GDT/TSS stack updates behind the per-CPU path. **Done for the BSP
   runtime stack-update hook; AP-local GDT/TSS allocation belongs to Phase B.**
6. Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5).
   **Done for the BSP timer path, with PIT used for calibration and PIT/PIC
   retained as a fallback.**

After Phase A, the kernel still runs user work on one CPU but the BSP per-CPU
plumbing is in place. Existing tests (`make run-smoke` and `make run-spawn`)
continue to pass.

---

## Phase B: AP Startup

Bring Application Processors (APs) online. Each AP runs the same kernel code
with its own per-CPU state.

2026-04-25 grounding checkpoint: the next implementation slice should use the
current local `limine` crate's MP API, not the older `SmpRequest` naming used
in some protocol examples. In capOS's pinned crate, `limine::request::MpRequest`
returns architecture-specific `limine::mp::MpRespData`; x86_64 CPU records are
`limine::mp::MpInfo` values with `processor_id`, `lapic_id`,
`MpInfo::bootstrap(entry, extra_arg)`, and `MpInfo::extra_argument()`. The
Phase B implementation is split into two checkpoints: first enumerate CPUs,
assign dense capOS CPU ids separately from Limine's ACPI `processor_id`, and
allocate AP state/stack slots; then bind each non-BSP CPU to a slot via
`extra_arg`, start it with `bootstrap`, and park it in a kernel idle loop after
local CPU initialization. Both checkpoints are implemented; APs still must not
run userspace or mutate the global scheduler.

### Limine MP Request

Limine provides an MP response with per-CPU records. Each x86_64 record
contains an ACPI processor id, LAPIC ID, and an atomic boot handoff. In the
local `limine` crate, callers should use `MpInfo::bootstrap()` rather than
writing the raw `goto_addr` field directly.

```rust
use limine::request::MpRequest;

static MP_REQUEST: MpRequest = MpRequest::new(0);

fn start_aps() {
    let mp = MP_REQUEST.response().expect("no MP response");
    let mut next_cpu_id = 1;
    for cpu in mp.cpus() {
        if cpu.lapic_id == mp.bsp_lapic_id {
            continue; // skip BSP
        }
        let cpu_id = next_cpu_id;
        next_cpu_id += 1;
        record_boot_processor_id(cpu_id, cpu.processor_id);
        let ap = init_ap_record(cpu_id, cpu.processor_id, cpu.lapic_id);
        cpu.bootstrap(ap_entry, ap as *const ApCpu as u64);
    }
}
```

### AP Entry

Each AP must:

1. Switch to the capOS kernel PML4 and AP-owned kernel stack
2. Enable per-CPU CR4 state used by the kernel page tables and user-access
   guards
3. Load its per-CPU GDT and TSS
4. Load the shared IDT
5. Set `KernelGSBase` MSR to its `PerCpu` pointer
6. Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
7. Signal "ready" to BSP (atomic flag or counter)
8. Enter a parked kernel idle loop

Local APIC timer setup and IPI handling remain separate Stage 7 gates; parked
APs keep interrupts disabled until that work is ready.

```rust
/// AP entry point. Called by Limine with the MP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::mp::MpInfo) -> ! {
    let ap_ptr = info.extra_argument() as *const ApCpu;
    let ap = unsafe {
        ap_ptr
            .as_ref()
            .expect("Limine AP extra_argument must be an ApCpu pointer")
    };
    let per_cpu = ap.per_cpu();

    // Switch from Limine state to capOS-owned paging and AP stack.
    ap.switch_to_kernel_paging_and_stack();

    // Match per-CPU CR4 state after the kernel PML4 is live.
    paging::enable_global_pages_on_current_cpu();
    smap::init();

    // Load this CPU's GDT + TSS
    ap.descriptors.load();

    // Shared IDT (same across all CPUs)
    idt::init();

    // Set GS base for swapgs
    unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }

    // Configure syscall MSRs (same values as BSP)
    syscall::init_msrs();

    // Signal ready
    ap.online.store(true, Ordering::Release);
    AP_READY_COUNT.fetch_add(1, Ordering::AcqRel);

    // Park until a later scheduler milestone gives APs runnable work.
    ap_idle_loop();
}
```

The `extra_argument` pointer must name an initialized, non-null `ApCpu` record
whose storage outlives the AP. The BSP publishes that record before calling
`MpInfo::bootstrap()`, and the AP treats the contained `PerCpu` pointer as
CPU-local state after entry.

### Scheduler Boundary

Phase B does not extend the Stage 5 scheduler. The BSP remains the only CPU
that runs userspace or mutates the global scheduler. APs only run enough kernel
initialization to prove that per-CPU architectural state is valid, signal ready,
and park in a bounded `hlt` loop.

Per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing
that chooses the most-overdue runnable sibling candidate, bounded
idle-to-runnable wake targeting that walks eligible idle scheduler CPUs, and
address-space CPU residency tracking are the current Phase C structure. The
temporary 2026-05-02 single-global-runnable-queue collapse is historical;
Scheduler Evolution Phase D (closed 2026-05-10) reintroduced per-CPU queues
with weighted fair ordering, and Phase E closed `SchedulingContext`
bind/revoke, budget, donation/return, and depletion notification on top of
that. Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry,
housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the
clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake
progress, the first automatic nohz activation increment closed via
[`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-activation.md),
and SQPOLL-driven auto-nohz activation closed via
[`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-sqpoll.md);
timeout-based auto-revoke and ordinary-thread generic full-nohz admission are
also landed. Generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz
issuance remain future work.
CPU affinity policy, shared scheduler metadata splitting, scheduler-driven AP
idle policy, broader workload classes, higher-thread-count evidence, and the
named Phase F.5 16/32-core scalability proof remain Phase C/F follow-ups. The
first Phase C scheduler proof may continue to use the current process ring
while the runtime serializes ring consumption.
Full SMP where sibling threads from one process wait independently on different
CPUs should use the Ring v2 direction in
[Ring v2 For Full SMP](ring-v2-smp-proposal.md): `cap_enter` waits on the
current thread's CQ, not on a shared process CQ.

### Boot Sequence

```
BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
  AP1: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
  AP2: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
  ...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler
```

---

## Phase C: SMP Correctness

With APs parked in kernel idle loops, Phase C makes user scheduling safe on
more than one CPU. The order is:

1. Move syscall entry/exit and per-CPU access to `KernelGsBase`/`swapgs` so APs
   do not use BSP-symbol-relative syscall stack fields. This includes
   non-`sysretq` paths that block or exit through scheduler `iretq` restore.
   **Done for syscall stack fields and syscall-originated restore paths.**
2. Add LAPIC timer and IPI support so each CPU can take local scheduler ticks
   and receive cross-CPU requests.
   **Done for PIT-calibrated BSP LAPIC ticks, parked-AP LAPIC initialization,
   spurious-vector handling, vector 49, a bounded vector-49-only fixed IPI send
   primitive, live TLB shootdown users, and bounded idle-to-runnable reschedule
   requests.**
3. Add TLB shootdown before any user address space can run on more than one CPU
   over its lifetime.
   **Done for user page-table map/unmap/protect through resident CPU masks,
   vector-49 shootdown, pending full-TLB flush generations, completion waits,
   and syscall-entry/flush-before-user-return hooks. Remote AP targets become
   active when AP scheduler ownership records AP residency.**
4. Split scheduler current/run-queue ownership into per-CPU state, with a
   reviewed AP idle-to-runnable handoff.
   **Done for per-CPU current-thread slots, the first AP cpu=1 scheduler owner
   handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable
   queues, bounded stealing, and bounded idle-to-runnable wake targeting;
   shared scheduler lock reduction, temporary pinning replacement, broader
   workload evidence, and higher-thread-count evidence remain deferred.**
5. Prove the existing manifest/ring/thread/park smokes under `-smp 2`.

With multiple CPUs running scheduler-owned work, shared mutable state needs
careful handling.

### TLB Shootdown

When the kernel modifies page tables that other CPUs may have cached in their
TLBs, it must send an IPI to those CPUs to invalidate the affected entries.

Scenarios requiring shootdown:

- **Process exit** -- unmapping user pages. Only the CPU running the process
  has the mapping cached, but if the process migrated recently, stale TLB
  entries may exist on the old CPU.
- **Shared kernel mappings** -- changes to the kernel half of page tables
  (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
- **Capability-granted shared memory** -- if future stages allow shared
  memory regions between processes, modifications require targeted shootdown.

Current code uses local mapper flushes in `AddressSpace::map`,
`AddressSpace::unmap`, and `AddressSpace::protect`, then calls the serialized
shootdown helper with the address space's resident CPU mask. Those methods are
reached from `VirtualMemoryCap`'s `parse_map`, `parse_unmap`, and
`parse_protect` anonymous mapping paths and
`MemoryObjectCap::{map,unmap,protect}` borrowed mapping paths. Scheduler CR3
handoff marks the selected address space resident on the current CPU, including
AP cpu=1 during the first AP scheduler-owner proof.

Implementation state consists of vector 49, a resident CPU target mask, and
per-CPU pending full-TLB flush generations. The first implementation records
pending flush generations for online resident CPUs other than the caller, after
the local page-table edit and local flush complete, then sends vector-49 IPIs to
prompt immediate drain and returns a completion token. VM capability handlers
enqueue completion work after dropping the address-space guard, and `cap_enter`
or timer polling drains the queue after ring dispatch releases cap-table and
scratch locks. Handlers reserve fixed-size queue slots before page-table
mutation, so overload is reported before rollback, unmap, or protect can mutate
state. Drains flush the current CPU before waiting, so a CPU that is itself in
the target mask cannot wait on its own pending generation. A target CPU that is
already in a syscall and contending on those
same locks can eventually reach the IPI or return-path drain. If a target CPU
has maskable interrupts delayed while it runs a kernel path, it still drains its
pending generation at syscall entry or before returning to userspace from
syscall, timer, or scheduler restore paths.

```rust
fn shootdown_page(resident_cpu_mask: u64) {
    let targets = resident_cpu_mask & online_cpu_mask() & !current_cpu_bit();
    let generation = next_shootdown_generation();
    for cpu_id in targets {
        PENDING_FLUSH_GENERATION[cpu_id].store(generation, Ordering::Release);
        lapic::send_fixed_ipi(lapic_id_for_cpu(cpu_id));
    }
    ShootdownCompletion { targets, generation }
}

fn flush_pending_for_current_cpu() {
    while pending_generation(current_cpu_id()) != flushed_generation(current_cpu_id()) {
        let generation = pending_generation(current_cpu_id());
        x86_64::instructions::tlb::flush_all();
        FLUSHED_GENERATION[current_cpu_id()].store(generation, Ordering::Release);
    }
}
```

The first implementation targets the address space's resident CPU mask rather
than every online CPU so parked APs with interrupts disabled are not disturbed.
It relies on kernel user-buffer access continuing through address-space-locked
HHDM copy/read helpers rather than raw user virtual addresses while a delayed
flush generation exists. Broader range and page-level coalescing can be added
after AP scheduling exists.

### LAPIC/IPI Boundary

The normal timer path is now local-APIC-backed: vector 48 handles scheduler
ticks with LAPIC EOI after PIT-channel-2 calibration, vector 49 handles TLB
shootdown and bounded idle-to-runnable reschedule requests, vector 255 handles
LAPIC spurious interrupts without EOI, and vector 32 remains only for the
PIT/PIC fallback. AP scheduler owners program their LAPIC timers from the BSP
calibration before entering the scheduler-owner loop; if AP timer setup is
unavailable, the BSP keeps scheduler ownership. The remaining LAPIC/IPI work is
broader scheduler-driven AP idle policy, future preemptive reschedule policy,
and a later x2APIC MSR backend after the architectural xAPIC MMIO path is
correct, not the bounded idle-to-runnable wake request path.

The TLB shootdown IPI handler must not allocate and must not take locks that can
be held while sending a shootdown. Completion waits must happen after dropping
the mutated address space's lock and ring dispatch's cap-table/scratch locks.
The deferred completion queue must remain bounded, non-allocating at enqueue,
and reserved before page-table mutation.
Syscall-entry and user-return paths must drain pending flush generations so
delayed maskable IPI delivery cannot leave a target CPU unable to observe
completion or resume a thread with stale TLB state.

KVM paravirtual features such as `kvm-pv-eoi`, `kvm-pv-ipi`, and
`kvm-pv-tlb-flush` are future performance work. They must not be required for
the first LAPIC timer, IPI, or TLB-shootdown correctness proofs.

### Lock Audit

Existing spinlocks need review for SMP safety:

| Lock | Current Use | SMP Concern |
|---|---|---|
| `SERIAL` | COM1 output | Safe but high contention if many CPUs print. Acceptable for debug output. |
| `ALLOCATOR` | Frame bitmap | Hot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists. |
| `KERNEL_CAPS` | Kernel cap table | Low contention (init only). Safe. |
| `SCHEDULER.current` | Single global running-thread slot | Split into `PerCpu.current_thread` in Phase A. |

Before APs can run userspace, the scheduler also needs an explicit CPU
residency record for each live thread or address space. That record drives TLB
shootdown targeting and prevents migration from racing page-table changes.
Process exit and thread exit must clear residency before freeing stacks,
address spaces, or ring state that another CPU might still observe.

**Interrupt + spinlock deadlock:** if CPU A holds a spinlock and takes an
interrupt whose handler tries to acquire the same lock, deadlock. This is
already noted in `REVIEW.md`. Fix: disable interrupts while holding locks
that interrupt handlers may need (frame allocator, serial). The `spin` crate
supports `MutexIrq` for this pattern, or use manual `cli`/`sti` wrappers.

### Allocator Scaling

The frame allocator is behind a single spinlock with O(n) bitmap scan.
Under SMP, this becomes a contention bottleneck.

Options (in order of complexity):

1. **Per-CPU free list cache** -- each CPU maintains a small cache of free
   frames (e.g., 64 frames). Refill from the global allocator when empty,
   return batch when full. Reduces lock acquisitions by ~64x.
2. **Region partitioning** -- divide physical memory into per-CPU regions.
   Each CPU owns a bitmap partition. Cross-CPU allocation falls back to
   a global lock. More complex, better NUMA behavior (future).

Option 1 is recommended for initial SMP. ~50-100 lines.

The heap allocator (`linked_list_allocator`) is also behind a single lock.
For a research OS this is acceptable initially -- heap allocations in the
kernel should be infrequent compared to frame allocations.

---

## Cap'n Proto Schema Additions

SMP introduces a kernel-internal `CpuManager` capability for inspecting and
controlling CPU state. This is not exposed to userspace initially but follows
the "everything is a capability" principle.

```capnp
interface CpuManager {
    # Number of online CPUs.
    cpuCount @0 () -> (count :UInt32);

    # Per-CPU info.
    cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}
```

This capability would be held by init (or a system monitor process) for
diagnostics. It's additive and can be deferred until the mechanism is useful.

---

## Estimated Scope

| Phase | New/Changed Code | Depends On |
|---|---|---|
| Phase A: BSP per-CPU foundation | Done (BSP PerCpu, syscall-stack storage, scheduler mirror, stack-update hook) | Stage 5 |
| Phase B: AP startup | Done (MpRequest, AP records/stacks, AP CR3/RSP handoff, parked idle) | Phase A |
| Phase C: Multi-CPU scheduling | In progress (GS/swapgs migration, LAPIC timer/IPI with EOI, shootdown-aware VM mutation wrappers, pending TLB generation completion, per-CPU current slots, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open) | Phase B |
| Ring v2 for full SMP | TBD (per-thread rings, completion routing, SQPOLL ownership) | Phase C plus threading/park |
| **Total** | **TBD after Phase C hardware/scheduler audit** | |

---

## Milestones

- **M1: Per-CPU data on BSP** -- BSP `PerCpu` syscall-stack/current-thread
  state, BSP per-CPU kernel-entry stack hook, and single-CPU QEMU proofs.
  **Done.**
- **M2: APs running** -- secondary CPUs reach `idle_loop()`. BSP prints
  "N CPUs online". `make run` still runs init on BSP.
  **Done.**
- **M3: TLB shootdown** -- page table modifications are safe across CPUs.
  Process exit on one CPU doesn't leave stale mappings on others.
  **Done for address-space resident masks and AP cpu=1 residency marking.**
- **M4: Multi-CPU scheduling** -- processes can run on any CPU. The existing
  boot-manifest service set still works, but the scheduler distributes work
  across CPUs once runnable processes are available (runtime spawning still
  depends on `ProcessSpawner`). Temporary scheduler ownership on CPUs 0-3,
  per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable
  wake targeting are implemented; shared scheduler lock reduction, temporary
  pinning replacement, scheduler-driven AP idle policy, broader workload
  evidence, and higher-thread-count evidence remain open.
- **M5: Ring v2 completion ownership** -- every live thread can own a ring
  endpoint; endpoint, timer, park, process-wait, and thread-join completions
  route by `ThreadRef`. This is the target for full SMP where sibling threads
  in one process wait independently on different CPUs.

---

## Open Questions

1. **x2APIC backend.** Phase C currently has an xAPIC MMIO LAPIC foundation.
   A later x2APIC MSR backend is still needed for newer/high-core
   systems and firmware states where xAPIC is unavailable or locked out; it
   should not block TLB shootdown on the current implementation path.

2. **Idle strategy.** `hlt` is the simplest idle. `mwait` is more
   power-efficient and can be used to wake on memory writes. Overkill for
   QEMU, but worth noting for future hardware targets.

3. **CPU hotplug.** Limine starts all CPUs at boot. Runtime CPU
   online/offline is a future concern, not needed initially.

4. **NUMA awareness.** Multi-socket systems have non-uniform memory access.
   Per-CPU frame allocator regions could be NUMA-aware. Deferred -- QEMU
   emulates flat memory by default.

5. **Scheduler policy.** The current multi-CPU scheduler uses per-CPU WFQ
   runnable queues ordered by `virtual_finish_ns` under the shared scheduler
   lock, with bounded stealing from sibling queues when a CPU has no local
   runnable entry. Scheduler Evolution Phase D (per-CPU WFQ and bounded
   stealing, closed 2026-05-10) and Phase E (`SchedulingContext` bind/revoke,
   budget, donation/return, depletion notification) are closed against this
   substrate; Phase F has landed the one-SQ-consumer prerequisite, nohz
   telemetry, housekeeping/deferred-work placement, the bounded SQPOLL ring
   mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL
   producer-wake progress; the first automatic nohz activation increment and
   SQPOLL-driven auto-nohz activation are both closed (see
   [`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-activation.md)
   and
   [`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-sqpoll.md)).
   The older round-robin/global-overflow starting point is historical, not
   the current baseline. Future refinements are shared-lock reduction,
   temporary pinning replacement, stronger CPU-affinity/admission policy,
   broader workload-class evidence, higher-thread-count evidence, and the
   Phase F.5 full-SMP 16/32-core scalability proof.

---

## References

### Specifications

- [Intel SDM Vol. 3, Chapter 8](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)
  -- Multiple-Processor Management (AP startup, APIC, IPI)
- [Intel SDM Vol. 3, Chapter 10](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)
  -- APIC (Local APIC, I/O APIC, x2APIC)
- [xAPIC Deprecation Plan](https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/xapic-deprecation-plan.html)
  -- Intel guidance on x2APIC defaults, legacy xAPIC deprecation, and guest
  virtualization
- [CPUID Enumeration and Architectural MSRs](https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/cpuid-enumeration-and-architectural-msrs.html)
  -- x2APIC MSR range and xAPIC disable/lock behavior
- [OSDev Wiki: SMP](https://wiki.osdev.org/Symmetric_Multiprocessing)
- [OSDev Wiki: APIC](https://wiki.osdev.org/APIC)

### Limine

- [Limine SMP Feature](https://github.com/limine-bootloader/limine/blob/trunk/PROTOCOL.md)
  -- MP request/response API, AP startup mechanism

### Virtualization

- [QEMU / KVM CPU model configuration](https://www.qemu.org/docs/master/system/qemu-cpu-models.html)
  -- CPU feature exposure, host passthrough, and named-model configuration
- [QEMU Paravirtualized KVM features](https://www.qemu.org/docs/master/system/i386/kvm-pv.html)
  -- optional KVM PV EOI, IPI, TLB-flush, and extended destination-id features
- [Linux KVM API](https://www.kernel.org/doc/html/latest/virt/kvm/api.html)
  -- VMM-side LAPIC/x2APIC state handling

### Prior Art

- [Redox SMP](https://gitlab.redox-os.org/redox-os/kernel) -- per-CPU
  contexts, LAPIC timer, IPI-based TLB shootdown
- [xv6-riscv SMP](https://github.com/mit-pdos/xv6-riscv) -- minimal
  multi-core OS, clean per-CPU implementation
- [Hermit SMP](https://github.com/hermit-os/kernel) -- Rust unikernel
  with SMP support via per-core data and APIC
- [BlogOS](https://os.phil-opp.com/) -- educational x86_64 Rust OS
  (single-CPU, but good APIC coverage)
