Proposal: Symmetric Multi-Processing (SMP)
How capOS goes from single-CPU execution to utilizing all available processors.
Grounding and Cross-Links
The SMP substrate is one half of capOS’s multicore story; scheduler policy above it is the other half, and they advance through coupled gates. Read this proposal together with:
- Scheduler Evolution – Phase D (per-CPU
WFQ, bounded stealing) and Phase E (
SchedulingContextbind/revoke, budget, donation/return, depletion notification) are closed; Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress, the first automatic nohz activation increment closed viadocs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md, and SQPOLL-driven auto-nohz activation closed viadocs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed; generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz issuance remain future work; Phase F.5 (full-SMP 16/32-core scalability planning) is the named gate for the milestone described below in Full-SMP Scalability Milestone and remains planning, not closed. - In-Process Threading Contract – thread-owned
execution state, generation-checked
ThreadRefqueues and wake records, per-thread ring mappings, and the recorded same-process 1-to-2 / diagnostic 1-to-4 evidence rows that this proposal’s scalability work must keep honoring. - Design Risks Register, Q9 – CPU accounting and scheduling
contexts
– partial-status answer that covers per-CPU WFQ, Phase E
SchedulingContext, and the cross-service donation / nohz activation / isolation lease / cross-principal fairness work still open. - Ring v2 For Full SMP – per-thread ring
endpoints and
cap_enter-on-thread-CQ are the dispatch contract this proposal’s scheduler-ownership milestones rely on. - SMP Phase C backlog – decomposed task list for the in-progress Phase C work tracked below.
The migrated task
kernel-upper-half-pml4-propagation-hardening
carries the Phase C residual for kernel upper-half page-table mutation after AP
startup. The retained finding is closed for the current kernel
MMIO/firmware helper path: paging::init() pre-seeds the helper’s upper-half
PML4 slot, AddressSpace::new_user clones upper-half entries from the
synchronized kernel root under the kernel page-table lock, and
map_kernel_physical_range rejects any attempt to create a previously absent
kernel-half PML4 slot after a user address space has been created. User-side
AddressSpace::{map,unmap,protect} remains shootdown-aware against resident
CPU masks; kernel upper-half edits inside pre-existing slots use the
kernel-wide shootdown path. Future helper windows or allocator-growth paths
that would require a new upper-half PML4 slot must pre-seed that slot before
user address-space creation or add synchronized active propagation into live
address spaces.
This document has three phases: a per-CPU foundation (prerequisite plumbing), AP startup (bringing secondary CPUs online), and SMP correctness (making shared state safe under concurrency).
Current status: Phase A’s BSP per-CPU foundation and Phase B AP startup are complete. Phase C has completed syscall GS migration, LAPIC/IPI, TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing, and bounded idle-to-runnable wake targeting for queued and direct-IPC wakeups. The current scheduler is no longer the temporary single-global-runnable-queue shape from the 2026-05-02 collapse. Remaining SMP risks are the shared scheduler lock, temporary pinning replacement, scheduler-driven AP idle policy, broader workload classes, and higher-thread-count evidence. The next SMP product-level milestone should be full-SMP scalability evidence on a real 16/32-core environment, with QEMU kept for boot and regression coverage rather than as the primary performance source.
Implementation checkpoint: the BSP now has a concrete PerCpu object with
stable syscall-stack offsets, and syscall entry uses KernelGsBase/swapgs
to reach the per-CPU kernel RSP and saved user RSP slots. The scheduler mirrors
its current ThreadRef into the BSP record.
Second checkpoint: runtime stack switches now flow through
percpu::set_kernel_entry_stack, which updates the BSP PerCpu.kernel_rsp
slot and the BSP TSS.RSP0 together. Scheduler and interrupt paths no longer
coordinate those two updates by calling separate GDT and syscall helpers.
Third checkpoint: kernel/src/arch/x86_64/smp.rs now issues the Limine
MpRequest, enumerates non-BSP CPUs, allocates AP-local PerCpu records and
kernel/IST stack storage, and records dense capOS CPU ids separately from Limine
processor and LAPIC ids.
Fourth checkpoint: APs now start through MpInfo::bootstrap() and reach a
parked kernel idle loop. The BSP passes an AP record pointer through Limine
extra_argument, waits for a bounded online count, and remains the only CPU
that schedules userspace. Each AP loads AP-owned GDT/TSS state, the shared IDT,
KernelGsBase, and syscall MSRs, reports online, disables interrupts, and
parks in hlt. Review tightened this checkpoint so APs first switch from
Limine handoff state to the capOS kernel PML4 and AP-owned kernel stack before
any online signal.
Fifth checkpoint: syscall entry/exit now runs with kernel GS active between
entry and return. Normal returns swap back before sysretq, and blocking or
exiting syscall paths that leave through scheduler iretq restore use a
dedicated trampoline to swap GS back before restoring the next user context.
Sixth checkpoint: the BSP now enables xAPIC MMIO, maps the LAPIC page through the kernel MMIO allocator, calibrates the LAPIC timer initial count against PIT channel 2, runs scheduler ticks through LAPIC timer vector 48 with LAPIC EOI, installs the LAPIC spurious vector, and masks the legacy PIC once LAPIC ticks are active. Parked APs initialize local APIC state before reporting online. IDT vector 49 and a bounded vector-49-only fixed IPI send primitive back TLB shootdown and bounded idle-to-runnable reschedule requests.
Seventh checkpoint: user page-table map, unmap, and protect now flush the
local CPU and then route through a serialized vector-49 TLB shootdown helper
using each AddressSpace’s resident CPU mask. The helper records pending
full-TLB flush generations and sends vector-49 IPIs to online resident CPUs
other than the caller, then returns a completion token that callers wait after
dropping ring dispatch locks. Scheduler CR3 handoff points mark the selected
address space resident on the current CPU.
Eighth checkpoint: scheduler current-thread state is split into per-CPU slots,
AP PerCpu records are registered for current-thread and kernel-entry stack
updates, AP TSS.RSP0 is updated during context switches, and AP cpu=1 can enter
the scheduler from the AP idle loop when its LAPIC timer is available. The
first AP proof intentionally keeps one scheduler owner: when AP cpu=1 is online
with a programmed timer, the BSP remains in kernel idle so the process-wide
capability ring is not executed concurrently. The scheduler idle path is now a
per-CPU CPL0 (kernel-mode) idle thread; the user-mode idle process was removed
in commit e3c0df01 (2026-05-14 UTC). “Kernel idle” throughout this proposal
refers to that per-CPU CPL0 idle thread, not a user-mode idle process.
Depends on: Stage 5 (Scheduling) – needs a working timer, context switch, and run queue on the BSP before adding more CPUs.
Phase B completion: AP startup is implemented and reviewed. The private
process-buffer validate_user_buffer
TOCTOU blocker is closed for single locked copy/read paths, and Phase A now
has the BSP running through concrete per-CPU syscall-stack/current-thread
state. TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler
ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and
bounded idle-to-runnable wake targeting are implemented; shared scheduler lock
contention, temporary pinning replacement, scheduler-driven AP idle policy,
broader workload classes, higher-thread-count evidence, and shared
SharedParkSpace park key derivation remain later Stage 7 work. Shared
keys still need MemoryObject mapping provenance or object pins before they can
keep backing stable beyond one address-space-locked access.
Full-SMP Scalability Milestone
The current SMP evidence reaches four physical-core workers and one eight-logical-CPU SMT run under QEMU/KVM. That was enough to expose scheduler structure problems, but it is not the shape that should define whether capOS really uses modern multicore machines. The next SMP milestone should answer a more concrete question: can ordinary capOS workloads keep useful throughput and bounded scheduler overhead as the machine scales to 16 and 32 physical cores?
Preferred evidence environment:
- direct capOS boot on a dedicated bare-metal or cloud bare-metal/perf-runner machine with at least 16 physical cores, and a 32-core row when hardware is available;
- recorded CPU topology, SMT state, APIC mode, timer source, frequency policy, memory size, firmware/device model, source commit, toolchain, and kernel configuration;
- Linux native baselines on the same machine for comparable CPU workloads;
- QEMU/KVM rows only for boot/regression continuity or for explicitly labeled virtualized comparisons.
Workload coverage should move beyond one fixed checksum row:
- static map/reduce checksum over equal byte ranges;
- uneven dynamic task pool with deterministic task ids and result hash;
- barrier-heavy phase loop that exposes wakeup and cross-CPU coordination cost;
- same-process thread workload and independent-process workload;
- IPC/service-bound worker workload that includes capability calls outside the timed compute loop.
Each workload should report 1, 2, 4, 8, 16, and 32-worker rows when the hardware supports those counts, with SMT rows separated from physical-core rows. Each row should include both work-window time and total time, run count, warmup policy, median, variance, and verifier output. The report should show speedup and efficiency curves instead of reducing the result to one boolean threshold.
Implementation work expected before this milestone:
- replace the temporary scheduler CPU mask and static four-owner assumptions with discovered CPU topology and dynamic per-CPU scheduler structures;
- decide xAPIC versus x2APIC backend selection for larger APIC-id spaces;
- split or otherwise shrink the shared scheduler-lock critical sections that still serialize queue selection, wakeups, blocking, and cleanup;
- make placement topology-aware enough to distinguish physical cores, SMT siblings, and later NUMA/cache groups;
- keep TLB shootdown, timer, reschedule-IPI, cleanup, and accounting costs observable per CPU and per workload phase;
- keep per-thread ring ownership and SQ-consumer ownership generation-checked as CPU count rises.
This milestone belongs with scheduler evolution and benchmark planning rather than a new standalone proposal: the SMP proposal defines the CPU substrate, Scheduler Evolution Phase F.5 defines dispatch and policy work for full-SMP 16/32-core scalability, the benchmark proposal defines artifact shape, and the HPC parallel-pattern proposal defines the workload matrix. Q9 in the design risks register is the matching open-question entry: base CPU accounting and scheduling-context authority through Phase E are implemented, while cross-service donation, full nohz activation, CPU isolation leases, and cross-principal fairness are the named follow-ons that this milestone’s evidence will be evaluated against.
Current State
APs can boot into kernel idle loops, and CPUs 0-3 can temporarily own scheduler/user work when their LAPIC timers are available. Specific assumptions that Phase C must still remove:
| Component | File | Assumption |
|---|---|---|
| Syscall stack switching | kernel/src/arch/x86_64/syscall.rs, kernel/src/arch/x86_64/percpu.rs | Syscall entry/exit uses KernelGsBase/swapgs and GS-relative PerCpu stack fields on the running CPU |
| AP GDT, TSS, kernel stacks | kernel/src/arch/x86_64/gdt.rs, kernel/src/arch/x86_64/smp.rs | AP-local descriptor tables and stacks exist, and AP TSS.RSP0 updates during AP scheduler context switches |
| IDT | kernel/src/arch/x86_64/idt.rs | Single static IDT (shareable – IDT can be the same across CPUs) |
| SYSCALL MSRs | kernel/src/arch/x86_64/syscall.rs, kernel/src/arch/x86_64/smp.rs | STAR/LSTAR/SFMASK/EFER are initialized on BSP and parked APs; BSP and AP startup both publish KernelGsBase |
| Current thread and run queues | kernel/src/sched.rs, kernel/src/arch/x86_64/percpu.rs | SCHEDULER owns per-CPU current slots, per-CPU WFQ runnable queues ordered by virtual_finish_ns, bounded stealing from sibling queues, and wake placement through WakePolicy::QueueCpu; queued and direct-IPC wakeups iterate eligible idle scheduler CPUs and wake the first that accepts a fresh reschedule IPI, and CPUs 0-3 can temporarily own scheduler/user execution when their LAPIC timers are available, while shared-lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred |
| Timer/IPI delivery | kernel/src/arch/x86_64/context.rs, kernel/src/arch/x86_64/lapic.rs, kernel/src/arch/x86_64/pic.rs, kernel/src/arch/x86_64/pit.rs, kernel/src/arch/x86_64/tlb.rs | CPUs 0-3 use PIT-calibrated LAPIC timer vector 48 with LAPIC EOI when online; vector 49 services TLB shootdown and bounded reschedule requests |
| Frame allocator | kernel/src/mem/frame.rs | Single global ALLOCATOR behind one spinlock |
| Heap allocator | kernel/src/mem/heap.rs | linked_list_allocator behind one spinlock |
The first checkpoint removed the separate syscall RSP globals and made the BSP
PerCpu layout the owner of syscall stack state. The GS checkpoint now uses
KernelGsBase/swapgs for those offsets on syscall paths. The LAPIC checkpoint
removed the PIT/PIC interrupt dependency from the normal BSP scheduler tick,
kept PIT channel 2 as the LAPIC calibration source, installed the spurious
vector, and wired the IPI vector. The TLB checkpoint added resident CPU masks,
vector-49 shootdown, pending generation counters, completion waits, and
syscall-entry plus flush-before-user-return hooks for delayed maskable interrupt
delivery. The AP scheduler-owner checkpoint added per-CPU current slots and AP
cpu=1 scheduler entry. The remaining Phase C assumptions are in concurrent
run-queue ownership and reschedule routing, not in syscall stack lookup, the
primary timer source, user page-table mutation invalidation, or AP TSS updates.
Phase A: Per-CPU Foundation
Establish per-CPU data structures on the BSP. No APs are started yet – this phase makes the BSP’s own code SMP-ready so Phase B is a clean addition.
Per-CPU Data Region
Each CPU needs a private data area accessible via the GS segment base. On
x86_64, swapgs switches between user-mode GS (usually zero) and
kernel-mode GS (pointing to per-CPU data). The kernel sets KernelGSBase
MSR on each CPU during init.
The BSP checkpoint originally reached this layout as BSP_PER_CPU+offset from
assembly. Phase C now uses the same offsets through GS after swapgs on
syscall entry.
#![allow(unused)]
fn main() {
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
/// Self-pointer for accessing the struct from GS:0.
self_ptr: *const PerCpu,
/// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
kernel_rsp: u64,
/// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
user_rsp: u64,
/// Currently running thread on this CPU, if one is active.
current_thread: Option<ThreadRef>,
/// CPU index (0 = BSP).
cpu_id: u32,
/// LAPIC ID (from Limine MP info or CPUID).
lapic_id: u32,
}
}
The previous checkpointed syscall entry stub used the same offsets via the BSP symbol:
movq %rsp, BSP_PER_CPU+16(%rip) ; PerCpu.user_rsp
movq BSP_PER_CPU+8(%rip), %rsp ; PerCpu.kernel_rsp
The current syscall entry stub uses GS-relative addressing:
swapgs
movq %rsp, %gs:16 ; PerCpu.user_rsp
movq %gs:8, %rsp ; PerCpu.kernel_rsp
And symmetrically on return:
movq %gs:16, %rsp ; restore user RSP
swapgs
sysretq
Non-returning syscall paths need separate handling: exit, a blocking
cap_enter, and a terminal ThreadControl.exitThread can leave the syscall
entry path by building a CpuContext and restoring another thread with
iretq. Those paths must restore user GS ownership before iretq, even though
they never execute the normal sysretq epilogue.
Lock And Ownership Rules
PerCpu fields split by owner:
kernel_rspandTSS.RSP0are updated together throughpercpu::set_kernel_entry_stack.user_rspis written only by syscall entry assembly and read only while constructing a blocked-syscallCpuContext.current_threadmirrorsScheduler.current; the scheduler lock remains the authority for choosing and validating the current thread.cpu_idandlapic_idare immutable after CPU initialization.
Phase A keeps the global scheduler lock and process table. The PerCpu
current field is not a second scheduler authority; it is the per-CPU execution
cache that Phase B will use when multiple CPUs stop sharing one current
slot.
Per-CPU GDT, TSS, and Stacks
Each CPU needs its own:
- GDT – the TSS descriptor encodes a physical pointer to the CPU’s TSS, so each CPU needs a GDT with its own TSS entry. The segment layout (kernel CS/DS, user CS/DS) is identical across CPUs.
- TSS –
privilege_stack_table[0](kernel stack for interrupts from Ring 3) and IST entries (double-fault stack) must be per-CPU. - Kernel stack – each CPU needs its own stack for syscall/interrupt handling. Current size: 16 KB (4 pages). Same size per CPU.
- Double-fault stack – each CPU needs its own IST stack. Current size: 20 KB (5 pages).
#![allow(unused)]
fn main() {
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
// Allocate kernel stack (4 pages) and double-fault stack (5 pages)
let kernel_stack = alloc_stack(4);
let df_stack = alloc_stack(5);
// Create TSS with per-CPU stacks
let mut tss = TaskStateSegment::new();
tss.privilege_stack_table[0] = kernel_stack.top();
tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();
// Create GDT with this CPU's TSS
let (gdt, selectors) = create_gdt(&tss);
// Allocate and populate PerCpu struct
let per_cpu = Box::leak(Box::new(PerCpu {
self_ptr: core::ptr::null(), // filled below
kernel_rsp: kernel_stack.top().as_u64(),
user_rsp: 0,
current_thread: None,
cpu_id,
lapic_id,
}));
per_cpu.self_ptr = per_cpu as *const PerCpu;
per_cpu
}
}
LAPIC Initialization
Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for preemption on the BSP. AP startup must initialize enough local-APIC state for secondary CPUs to park in a kernel idle loop and for later IPIs. Migrating BSP preemption from PIT to LAPIC timer is still required before multi-CPU scheduling, since the PIT is a single shared device that cannot provide per-CPU timer interrupts. LAPIC work is needed for:
- Per-CPU timer – replace PIT with LAPIC timer (required for SMP)
- IPI – inter-processor interrupts for TLB shootdown and AP startup
- Spurious interrupt vector – must be configured per-CPU
2026-04-25 research decision: the immediate Phase C LAPIC/IPI foundation uses xAPIC MMIO, LAPIC timer vector 48, IPI vector 49, LAPIC EOI, AP LAPIC initialization, and PIT/PIC fallback. The grounding note x2APIC and APIC virtualization records the checked Intel and QEMU/KVM sources and keeps x2APIC as a later backend rather than a reason to rework the current LAPIC gate.
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
| manual xAPIC MMIO backend | current LAPIC timer, EOI, IPI, spurious vector foundation | yes |
future manual x2APIC MSR backend using x86_64 MSR access | newer/high-core systems and firmware states where xAPIC is unavailable or undesirable | yes |
The current LAPIC path uses xAPIC MMIO through the kernel MMIO mapper. The
later x2APIC backend should still be small and explicit rather than adding an
APIC abstraction crate: read the APIC ID, enable x2APIC through
IA32_APIC_BASE, program the spurious-vector register, local-vector timer,
timer divide/initial-count registers, EOI, and ICR sends through MSRs. I/O APIC
remains separate MMIO hardware discovered through ACPI MADT and belongs to the
later interrupt-infrastructure/cloud path.
Migration Path
Phase A was a refactor of existing single-CPU code, not an addition:
- Add
PerCpustruct, allocate one instance for BSP. Done for BSP static storage. - Set BSP’s
KernelGSBaseMSR, addswapgsto syscall entry/exit. Done for syscall entry/exit, including syscall-to-iretqexits. - Replace
SYSCALL_KERNEL_RSP/SYSCALL_USER_RSPglobals with per-CPU accesses. Done; syscall assembly uses GS-relativePerCpuoffsets. - Replace scheduler’s global
SCHEDULER.currentwithPerCpu.current_thread. Partially done: the BSP per-CPU record mirrorsScheduler.current; the scheduler lock remains authoritative for current-thread and queue ownership until shared scheduler metadata is split further. - Move GDT/TSS stack updates behind the per-CPU path. Done for the BSP runtime stack-update hook; AP-local GDT/TSS allocation belongs to Phase B.
- Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5). Done for the BSP timer path, with PIT used for calibration and PIT/PIC retained as a fallback.
After Phase A, the kernel still runs user work on one CPU but the BSP per-CPU
plumbing is in place. Existing tests (make run-smoke and make run-spawn)
continue to pass.
Phase B: AP Startup
Bring Application Processors (APs) online. Each AP runs the same kernel code with its own per-CPU state.
2026-04-25 grounding checkpoint: the next implementation slice should use the
current local limine crate’s MP API, not the older SmpRequest naming used
in some protocol examples. In capOS’s pinned crate, limine::request::MpRequest
returns architecture-specific limine::mp::MpRespData; x86_64 CPU records are
limine::mp::MpInfo values with processor_id, lapic_id,
MpInfo::bootstrap(entry, extra_arg), and MpInfo::extra_argument(). The
Phase B implementation is split into two checkpoints: first enumerate CPUs,
assign dense capOS CPU ids separately from Limine’s ACPI processor_id, and
allocate AP state/stack slots; then bind each non-BSP CPU to a slot via
extra_arg, start it with bootstrap, and park it in a kernel idle loop after
local CPU initialization. Both checkpoints are implemented; APs still must not
run userspace or mutate the global scheduler.
Limine MP Request
Limine provides an MP response with per-CPU records. Each x86_64 record
contains an ACPI processor id, LAPIC ID, and an atomic boot handoff. In the
local limine crate, callers should use MpInfo::bootstrap() rather than
writing the raw goto_addr field directly.
#![allow(unused)]
fn main() {
use limine::request::MpRequest;
static MP_REQUEST: MpRequest = MpRequest::new(0);
fn start_aps() {
let mp = MP_REQUEST.response().expect("no MP response");
let mut next_cpu_id = 1;
for cpu in mp.cpus() {
if cpu.lapic_id == mp.bsp_lapic_id {
continue; // skip BSP
}
let cpu_id = next_cpu_id;
next_cpu_id += 1;
record_boot_processor_id(cpu_id, cpu.processor_id);
let ap = init_ap_record(cpu_id, cpu.processor_id, cpu.lapic_id);
cpu.bootstrap(ap_entry, ap as *const ApCpu as u64);
}
}
}
AP Entry
Each AP must:
- Switch to the capOS kernel PML4 and AP-owned kernel stack
- Enable per-CPU CR4 state used by the kernel page tables and user-access guards
- Load its per-CPU GDT and TSS
- Load the shared IDT
- Set
KernelGSBaseMSR to itsPerCpupointer - Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
- Signal “ready” to BSP (atomic flag or counter)
- Enter a parked kernel idle loop
Local APIC timer setup and IPI handling remain separate Stage 7 gates; parked APs keep interrupts disabled until that work is ready.
#![allow(unused)]
fn main() {
/// AP entry point. Called by Limine with the MP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::mp::MpInfo) -> ! {
let ap_ptr = info.extra_argument() as *const ApCpu;
let ap = unsafe {
ap_ptr
.as_ref()
.expect("Limine AP extra_argument must be an ApCpu pointer")
};
let per_cpu = ap.per_cpu();
// Switch from Limine state to capOS-owned paging and AP stack.
ap.switch_to_kernel_paging_and_stack();
// Match per-CPU CR4 state after the kernel PML4 is live.
paging::enable_global_pages_on_current_cpu();
smap::init();
// Load this CPU's GDT + TSS
ap.descriptors.load();
// Shared IDT (same across all CPUs)
idt::init();
// Set GS base for swapgs
unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }
// Configure syscall MSRs (same values as BSP)
syscall::init_msrs();
// Signal ready
ap.online.store(true, Ordering::Release);
AP_READY_COUNT.fetch_add(1, Ordering::AcqRel);
// Park until a later scheduler milestone gives APs runnable work.
ap_idle_loop();
}
}
The extra_argument pointer must name an initialized, non-null ApCpu record
whose storage outlives the AP. The BSP publishes that record before calling
MpInfo::bootstrap(), and the AP treats the contained PerCpu pointer as
CPU-local state after entry.
Scheduler Boundary
Phase B does not extend the Stage 5 scheduler. The BSP remains the only CPU
that runs userspace or mutates the global scheduler. APs only run enough kernel
initialization to prove that per-CPU architectural state is valid, signal ready,
and park in a bounded hlt loop.
Per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing
that chooses the most-overdue runnable sibling candidate, bounded
idle-to-runnable wake targeting that walks eligible idle scheduler CPUs, and
address-space CPU residency tracking are the current Phase C structure. The
temporary 2026-05-02 single-global-runnable-queue collapse is historical;
Scheduler Evolution Phase D (closed 2026-05-10) reintroduced per-CPU queues
with weighted fair ordering, and Phase E closed SchedulingContext
bind/revoke, budget, donation/return, and depletion notification on top of
that. Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry,
housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the
clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake
progress, the first automatic nohz activation increment closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md,
and SQPOLL-driven auto-nohz activation closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md;
timeout-based auto-revoke and ordinary-thread generic full-nohz admission are
also landed. Generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz
issuance remain future work.
CPU affinity policy, shared scheduler metadata splitting, scheduler-driven AP
idle policy, broader workload classes, higher-thread-count evidence, and the
named Phase F.5 16/32-core scalability proof remain Phase C/F follow-ups. The
first Phase C scheduler proof may continue to use the current process ring
while the runtime serializes ring consumption.
Full SMP where sibling threads from one process wait independently on different
CPUs should use the Ring v2 direction in
Ring v2 For Full SMP: cap_enter waits on the
current thread’s CQ, not on a shared process CQ.
Boot Sequence
BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
AP1: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
AP2: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler
Phase C: SMP Correctness
With APs parked in kernel idle loops, Phase C makes user scheduling safe on more than one CPU. The order is:
- Move syscall entry/exit and per-CPU access to
KernelGsBase/swapgsso APs do not use BSP-symbol-relative syscall stack fields. This includes non-sysretqpaths that block or exit through scheduleriretqrestore. Done for syscall stack fields and syscall-originated restore paths. - Add LAPIC timer and IPI support so each CPU can take local scheduler ticks and receive cross-CPU requests. Done for PIT-calibrated BSP LAPIC ticks, parked-AP LAPIC initialization, spurious-vector handling, vector 49, a bounded vector-49-only fixed IPI send primitive, live TLB shootdown users, and bounded idle-to-runnable reschedule requests.
- Add TLB shootdown before any user address space can run on more than one CPU over its lifetime. Done for user page-table map/unmap/protect through resident CPU masks, vector-49 shootdown, pending full-TLB flush generations, completion waits, and syscall-entry/flush-before-user-return hooks. Remote AP targets become active when AP scheduler ownership records AP residency.
- Split scheduler current/run-queue ownership into per-CPU state, with a reviewed AP idle-to-runnable handoff. Done for per-CPU current-thread slots, the first AP cpu=1 scheduler owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting; shared scheduler lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred.
- Prove the existing manifest/ring/thread/park smokes under
-smp 2.
With multiple CPUs running scheduler-owned work, shared mutable state needs careful handling.
TLB Shootdown
When the kernel modifies page tables that other CPUs may have cached in their TLBs, it must send an IPI to those CPUs to invalidate the affected entries.
Scenarios requiring shootdown:
- Process exit – unmapping user pages. Only the CPU running the process has the mapping cached, but if the process migrated recently, stale TLB entries may exist on the old CPU.
- Shared kernel mappings – changes to the kernel half of page tables (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
- Capability-granted shared memory – if future stages allow shared memory regions between processes, modifications require targeted shootdown.
Current code uses local mapper flushes in AddressSpace::map,
AddressSpace::unmap, and AddressSpace::protect, then calls the serialized
shootdown helper with the address space’s resident CPU mask. Those methods are
reached from VirtualMemoryCap’s parse_map, parse_unmap, and
parse_protect anonymous mapping paths and
MemoryObjectCap::{map,unmap,protect} borrowed mapping paths. Scheduler CR3
handoff marks the selected address space resident on the current CPU, including
AP cpu=1 during the first AP scheduler-owner proof.
Implementation state consists of vector 49, a resident CPU target mask, and
per-CPU pending full-TLB flush generations. The first implementation records
pending flush generations for online resident CPUs other than the caller, after
the local page-table edit and local flush complete, then sends vector-49 IPIs to
prompt immediate drain and returns a completion token. VM capability handlers
enqueue completion work after dropping the address-space guard, and cap_enter
or timer polling drains the queue after ring dispatch releases cap-table and
scratch locks. Handlers reserve fixed-size queue slots before page-table
mutation, so overload is reported before rollback, unmap, or protect can mutate
state. Drains flush the current CPU before waiting, so a CPU that is itself in
the target mask cannot wait on its own pending generation. A target CPU that is
already in a syscall and contending on those
same locks can eventually reach the IPI or return-path drain. If a target CPU
has maskable interrupts delayed while it runs a kernel path, it still drains its
pending generation at syscall entry or before returning to userspace from
syscall, timer, or scheduler restore paths.
#![allow(unused)]
fn main() {
fn shootdown_page(resident_cpu_mask: u64) {
let targets = resident_cpu_mask & online_cpu_mask() & !current_cpu_bit();
let generation = next_shootdown_generation();
for cpu_id in targets {
PENDING_FLUSH_GENERATION[cpu_id].store(generation, Ordering::Release);
lapic::send_fixed_ipi(lapic_id_for_cpu(cpu_id));
}
ShootdownCompletion { targets, generation }
}
fn flush_pending_for_current_cpu() {
while pending_generation(current_cpu_id()) != flushed_generation(current_cpu_id()) {
let generation = pending_generation(current_cpu_id());
x86_64::instructions::tlb::flush_all();
FLUSHED_GENERATION[current_cpu_id()].store(generation, Ordering::Release);
}
}
}
The first implementation targets the address space’s resident CPU mask rather than every online CPU so parked APs with interrupts disabled are not disturbed. It relies on kernel user-buffer access continuing through address-space-locked HHDM copy/read helpers rather than raw user virtual addresses while a delayed flush generation exists. Broader range and page-level coalescing can be added after AP scheduling exists.
LAPIC/IPI Boundary
The normal timer path is now local-APIC-backed: vector 48 handles scheduler ticks with LAPIC EOI after PIT-channel-2 calibration, vector 49 handles TLB shootdown and bounded idle-to-runnable reschedule requests, vector 255 handles LAPIC spurious interrupts without EOI, and vector 32 remains only for the PIT/PIC fallback. AP scheduler owners program their LAPIC timers from the BSP calibration before entering the scheduler-owner loop; if AP timer setup is unavailable, the BSP keeps scheduler ownership. The remaining LAPIC/IPI work is broader scheduler-driven AP idle policy, future preemptive reschedule policy, and a later x2APIC MSR backend after the architectural xAPIC MMIO path is correct, not the bounded idle-to-runnable wake request path.
The TLB shootdown IPI handler must not allocate and must not take locks that can be held while sending a shootdown. Completion waits must happen after dropping the mutated address space’s lock and ring dispatch’s cap-table/scratch locks. The deferred completion queue must remain bounded, non-allocating at enqueue, and reserved before page-table mutation. Syscall-entry and user-return paths must drain pending flush generations so delayed maskable IPI delivery cannot leave a target CPU unable to observe completion or resume a thread with stale TLB state.
KVM paravirtual features such as kvm-pv-eoi, kvm-pv-ipi, and
kvm-pv-tlb-flush are future performance work. They must not be required for
the first LAPIC timer, IPI, or TLB-shootdown correctness proofs.
Lock Audit
Existing spinlocks need review for SMP safety:
| Lock | Current Use | SMP Concern |
|---|---|---|
SERIAL | COM1 output | Safe but high contention if many CPUs print. Acceptable for debug output. |
ALLOCATOR | Frame bitmap | Hot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists. |
KERNEL_CAPS | Kernel cap table | Low contention (init only). Safe. |
SCHEDULER.current | Single global running-thread slot | Split into PerCpu.current_thread in Phase A. |
Before APs can run userspace, the scheduler also needs an explicit CPU residency record for each live thread or address space. That record drives TLB shootdown targeting and prevents migration from racing page-table changes. Process exit and thread exit must clear residency before freeing stacks, address spaces, or ring state that another CPU might still observe.
Interrupt + spinlock deadlock: if CPU A holds a spinlock and takes an
interrupt whose handler tries to acquire the same lock, deadlock. This is
already noted in REVIEW.md. Fix: disable interrupts while holding locks
that interrupt handlers may need (frame allocator, serial). The spin crate
supports MutexIrq for this pattern, or use manual cli/sti wrappers.
Allocator Scaling
The frame allocator is behind a single spinlock with O(n) bitmap scan. Under SMP, this becomes a contention bottleneck.
Options (in order of complexity):
- Per-CPU free list cache – each CPU maintains a small cache of free frames (e.g., 64 frames). Refill from the global allocator when empty, return batch when full. Reduces lock acquisitions by ~64x.
- Region partitioning – divide physical memory into per-CPU regions. Each CPU owns a bitmap partition. Cross-CPU allocation falls back to a global lock. More complex, better NUMA behavior (future).
Option 1 is recommended for initial SMP. ~50-100 lines.
The heap allocator (linked_list_allocator) is also behind a single lock.
For a research OS this is acceptable initially – heap allocations in the
kernel should be infrequent compared to frame allocations.
Cap’n Proto Schema Additions
SMP introduces a kernel-internal CpuManager capability for inspecting and
controlling CPU state. This is not exposed to userspace initially but follows
the “everything is a capability” principle.
interface CpuManager {
# Number of online CPUs.
cpuCount @0 () -> (count :UInt32);
# Per-CPU info.
cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}
This capability would be held by init (or a system monitor process) for diagnostics. It’s additive and can be deferred until the mechanism is useful.
Estimated Scope
| Phase | New/Changed Code | Depends On |
|---|---|---|
| Phase A: BSP per-CPU foundation | Done (BSP PerCpu, syscall-stack storage, scheduler mirror, stack-update hook) | Stage 5 |
| Phase B: AP startup | Done (MpRequest, AP records/stacks, AP CR3/RSP handoff, parked idle) | Phase A |
| Phase C: Multi-CPU scheduling | In progress (GS/swapgs migration, LAPIC timer/IPI with EOI, shootdown-aware VM mutation wrappers, pending TLB generation completion, per-CPU current slots, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open) | Phase B |
| Ring v2 for full SMP | TBD (per-thread rings, completion routing, SQPOLL ownership) | Phase C plus threading/park |
| Total | TBD after Phase C hardware/scheduler audit |
Milestones
- M1: Per-CPU data on BSP – BSP
PerCpusyscall-stack/current-thread state, BSP per-CPU kernel-entry stack hook, and single-CPU QEMU proofs. Done. - M2: APs running – secondary CPUs reach
idle_loop(). BSP prints “N CPUs online”.make runstill runs init on BSP. Done. - M3: TLB shootdown – page table modifications are safe across CPUs. Process exit on one CPU doesn’t leave stale mappings on others. Done for address-space resident masks and AP cpu=1 residency marking.
- M4: Multi-CPU scheduling – processes can run on any CPU. The existing
boot-manifest service set still works, but the scheduler distributes work
across CPUs once runnable processes are available (runtime spawning still
depends on
ProcessSpawner). Temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open. - M5: Ring v2 completion ownership – every live thread can own a ring
endpoint; endpoint, timer, park, process-wait, and thread-join completions
route by
ThreadRef. This is the target for full SMP where sibling threads in one process wait independently on different CPUs.
Open Questions
-
x2APIC backend. Phase C currently has an xAPIC MMIO LAPIC foundation. A later x2APIC MSR backend is still needed for newer/high-core systems and firmware states where xAPIC is unavailable or locked out; it should not block TLB shootdown on the current implementation path.
-
Idle strategy.
hltis the simplest idle.mwaitis more power-efficient and can be used to wake on memory writes. Overkill for QEMU, but worth noting for future hardware targets. -
CPU hotplug. Limine starts all CPUs at boot. Runtime CPU online/offline is a future concern, not needed initially.
-
NUMA awareness. Multi-socket systems have non-uniform memory access. Per-CPU frame allocator regions could be NUMA-aware. Deferred – QEMU emulates flat memory by default.
-
Scheduler policy. The current multi-CPU scheduler uses per-CPU WFQ runnable queues ordered by
virtual_finish_nsunder the shared scheduler lock, with bounded stealing from sibling queues when a CPU has no local runnable entry. Scheduler Evolution Phase D (per-CPU WFQ and bounded stealing, closed 2026-05-10) and Phase E (SchedulingContextbind/revoke, budget, donation/return, depletion notification) are closed against this substrate; Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress; the first automatic nohz activation increment and SQPOLL-driven auto-nohz activation are both closed (seedocs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.mdanddocs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md). The older round-robin/global-overflow starting point is historical, not the current baseline. Future refinements are shared-lock reduction, temporary pinning replacement, stronger CPU-affinity/admission policy, broader workload-class evidence, higher-thread-count evidence, and the Phase F.5 full-SMP 16/32-core scalability proof.
References
Specifications
- Intel SDM Vol. 3, Chapter 8 – Multiple-Processor Management (AP startup, APIC, IPI)
- Intel SDM Vol. 3, Chapter 10 – APIC (Local APIC, I/O APIC, x2APIC)
- xAPIC Deprecation Plan – Intel guidance on x2APIC defaults, legacy xAPIC deprecation, and guest virtualization
- CPUID Enumeration and Architectural MSRs – x2APIC MSR range and xAPIC disable/lock behavior
- OSDev Wiki: SMP
- OSDev Wiki: APIC
Limine
- Limine SMP Feature – MP request/response API, AP startup mechanism
Virtualization
- QEMU / KVM CPU model configuration – CPU feature exposure, host passthrough, and named-model configuration
- QEMU Paravirtualized KVM features – optional KVM PV EOI, IPI, TLB-flush, and extended destination-id features
- Linux KVM API – VMM-side LAPIC/x2APIC state handling
Prior Art
- Redox SMP – per-CPU contexts, LAPIC timer, IPI-based TLB shootdown
- xv6-riscv SMP – minimal multi-core OS, clean per-CPU implementation
- Hermit SMP – Rust unikernel with SMP support via per-core data and APIC
- BlogOS – educational x86_64 Rust OS (single-CPU, but good APIC coverage)