Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Symmetric Multi-Processing (SMP)

How capOS goes from single-CPU execution to utilizing all available processors.

The SMP substrate is one half of capOS’s multicore story; scheduler policy above it is the other half, and they advance through coupled gates. Read this proposal together with:

  • Scheduler Evolution – Phase D (per-CPU WFQ, bounded stealing) and Phase E (SchedulingContext bind/revoke, budget, donation/return, depletion notification) are closed; Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress, the first automatic nohz activation increment closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md, and SQPOLL-driven auto-nohz activation closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed; generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz issuance remain future work; Phase F.5 (full-SMP 16/32-core scalability planning) is the named gate for the milestone described below in Full-SMP Scalability Milestone and remains planning, not closed.
  • In-Process Threading Contract – thread-owned execution state, generation-checked ThreadRef queues and wake records, per-thread ring mappings, and the recorded same-process 1-to-2 / diagnostic 1-to-4 evidence rows that this proposal’s scalability work must keep honoring.
  • Design Risks Register, Q9 – CPU accounting and scheduling contexts – partial-status answer that covers per-CPU WFQ, Phase E SchedulingContext, and the cross-service donation / nohz activation / isolation lease / cross-principal fairness work still open.
  • Ring v2 For Full SMP – per-thread ring endpoints and cap_enter-on-thread-CQ are the dispatch contract this proposal’s scheduler-ownership milestones rely on.
  • SMP Phase C backlog – decomposed task list for the in-progress Phase C work tracked below.

The migrated task kernel-upper-half-pml4-propagation-hardening carries the Phase C residual for kernel upper-half page-table mutation after AP startup. The retained finding is closed for the current kernel MMIO/firmware helper path: paging::init() pre-seeds the helper’s upper-half PML4 slot, AddressSpace::new_user clones upper-half entries from the synchronized kernel root under the kernel page-table lock, and map_kernel_physical_range rejects any attempt to create a previously absent kernel-half PML4 slot after a user address space has been created. User-side AddressSpace::{map,unmap,protect} remains shootdown-aware against resident CPU masks; kernel upper-half edits inside pre-existing slots use the kernel-wide shootdown path. Future helper windows or allocator-growth paths that would require a new upper-half PML4 slot must pre-seed that slot before user address-space creation or add synchronized active propagation into live address spaces.

This document has three phases: a per-CPU foundation (prerequisite plumbing), AP startup (bringing secondary CPUs online), and SMP correctness (making shared state safe under concurrency).

Current status: Phase A’s BSP per-CPU foundation and Phase B AP startup are complete. Phase C has completed syscall GS migration, LAPIC/IPI, TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing, and bounded idle-to-runnable wake targeting for queued and direct-IPC wakeups. The current scheduler is no longer the temporary single-global-runnable-queue shape from the 2026-05-02 collapse. Remaining SMP risks are the shared scheduler lock, temporary pinning replacement, scheduler-driven AP idle policy, broader workload classes, and higher-thread-count evidence. The next SMP product-level milestone should be full-SMP scalability evidence on a real 16/32-core environment, with QEMU kept for boot and regression coverage rather than as the primary performance source.

Implementation checkpoint: the BSP now has a concrete PerCpu object with stable syscall-stack offsets, and syscall entry uses KernelGsBase/swapgs to reach the per-CPU kernel RSP and saved user RSP slots. The scheduler mirrors its current ThreadRef into the BSP record.

Second checkpoint: runtime stack switches now flow through percpu::set_kernel_entry_stack, which updates the BSP PerCpu.kernel_rsp slot and the BSP TSS.RSP0 together. Scheduler and interrupt paths no longer coordinate those two updates by calling separate GDT and syscall helpers.

Third checkpoint: kernel/src/arch/x86_64/smp.rs now issues the Limine MpRequest, enumerates non-BSP CPUs, allocates AP-local PerCpu records and kernel/IST stack storage, and records dense capOS CPU ids separately from Limine processor and LAPIC ids.

Fourth checkpoint: APs now start through MpInfo::bootstrap() and reach a parked kernel idle loop. The BSP passes an AP record pointer through Limine extra_argument, waits for a bounded online count, and remains the only CPU that schedules userspace. Each AP loads AP-owned GDT/TSS state, the shared IDT, KernelGsBase, and syscall MSRs, reports online, disables interrupts, and parks in hlt. Review tightened this checkpoint so APs first switch from Limine handoff state to the capOS kernel PML4 and AP-owned kernel stack before any online signal.

Fifth checkpoint: syscall entry/exit now runs with kernel GS active between entry and return. Normal returns swap back before sysretq, and blocking or exiting syscall paths that leave through scheduler iretq restore use a dedicated trampoline to swap GS back before restoring the next user context.

Sixth checkpoint: the BSP now enables xAPIC MMIO, maps the LAPIC page through the kernel MMIO allocator, calibrates the LAPIC timer initial count against PIT channel 2, runs scheduler ticks through LAPIC timer vector 48 with LAPIC EOI, installs the LAPIC spurious vector, and masks the legacy PIC once LAPIC ticks are active. Parked APs initialize local APIC state before reporting online. IDT vector 49 and a bounded vector-49-only fixed IPI send primitive back TLB shootdown and bounded idle-to-runnable reschedule requests.

Seventh checkpoint: user page-table map, unmap, and protect now flush the local CPU and then route through a serialized vector-49 TLB shootdown helper using each AddressSpace’s resident CPU mask. The helper records pending full-TLB flush generations and sends vector-49 IPIs to online resident CPUs other than the caller, then returns a completion token that callers wait after dropping ring dispatch locks. Scheduler CR3 handoff points mark the selected address space resident on the current CPU.

Eighth checkpoint: scheduler current-thread state is split into per-CPU slots, AP PerCpu records are registered for current-thread and kernel-entry stack updates, AP TSS.RSP0 is updated during context switches, and AP cpu=1 can enter the scheduler from the AP idle loop when its LAPIC timer is available. The first AP proof intentionally keeps one scheduler owner: when AP cpu=1 is online with a programmed timer, the BSP remains in kernel idle so the process-wide capability ring is not executed concurrently. The scheduler idle path is now a per-CPU CPL0 (kernel-mode) idle thread; the user-mode idle process was removed in commit e3c0df01 (2026-05-14 UTC). “Kernel idle” throughout this proposal refers to that per-CPU CPL0 idle thread, not a user-mode idle process.

Depends on: Stage 5 (Scheduling) – needs a working timer, context switch, and run queue on the BSP before adding more CPUs.

Phase B completion: AP startup is implemented and reviewed. The private process-buffer validate_user_buffer TOCTOU blocker is closed for single locked copy/read paths, and Phase A now has the BSP running through concrete per-CPU syscall-stack/current-thread state. TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock contention, temporary pinning replacement, scheduler-driven AP idle policy, broader workload classes, higher-thread-count evidence, and shared SharedParkSpace park key derivation remain later Stage 7 work. Shared keys still need MemoryObject mapping provenance or object pins before they can keep backing stable beyond one address-space-locked access.


Full-SMP Scalability Milestone

The current SMP evidence reaches four physical-core workers and one eight-logical-CPU SMT run under QEMU/KVM. That was enough to expose scheduler structure problems, but it is not the shape that should define whether capOS really uses modern multicore machines. The next SMP milestone should answer a more concrete question: can ordinary capOS workloads keep useful throughput and bounded scheduler overhead as the machine scales to 16 and 32 physical cores?

Preferred evidence environment:

  • direct capOS boot on a dedicated bare-metal or cloud bare-metal/perf-runner machine with at least 16 physical cores, and a 32-core row when hardware is available;
  • recorded CPU topology, SMT state, APIC mode, timer source, frequency policy, memory size, firmware/device model, source commit, toolchain, and kernel configuration;
  • Linux native baselines on the same machine for comparable CPU workloads;
  • QEMU/KVM rows only for boot/regression continuity or for explicitly labeled virtualized comparisons.

Workload coverage should move beyond one fixed checksum row:

  • static map/reduce checksum over equal byte ranges;
  • uneven dynamic task pool with deterministic task ids and result hash;
  • barrier-heavy phase loop that exposes wakeup and cross-CPU coordination cost;
  • same-process thread workload and independent-process workload;
  • IPC/service-bound worker workload that includes capability calls outside the timed compute loop.

Each workload should report 1, 2, 4, 8, 16, and 32-worker rows when the hardware supports those counts, with SMT rows separated from physical-core rows. Each row should include both work-window time and total time, run count, warmup policy, median, variance, and verifier output. The report should show speedup and efficiency curves instead of reducing the result to one boolean threshold.

Implementation work expected before this milestone:

  • replace the temporary scheduler CPU mask and static four-owner assumptions with discovered CPU topology and dynamic per-CPU scheduler structures;
  • decide xAPIC versus x2APIC backend selection for larger APIC-id spaces;
  • split or otherwise shrink the shared scheduler-lock critical sections that still serialize queue selection, wakeups, blocking, and cleanup;
  • make placement topology-aware enough to distinguish physical cores, SMT siblings, and later NUMA/cache groups;
  • keep TLB shootdown, timer, reschedule-IPI, cleanup, and accounting costs observable per CPU and per workload phase;
  • keep per-thread ring ownership and SQ-consumer ownership generation-checked as CPU count rises.

This milestone belongs with scheduler evolution and benchmark planning rather than a new standalone proposal: the SMP proposal defines the CPU substrate, Scheduler Evolution Phase F.5 defines dispatch and policy work for full-SMP 16/32-core scalability, the benchmark proposal defines artifact shape, and the HPC parallel-pattern proposal defines the workload matrix. Q9 in the design risks register is the matching open-question entry: base CPU accounting and scheduling-context authority through Phase E are implemented, while cross-service donation, full nohz activation, CPU isolation leases, and cross-principal fairness are the named follow-ons that this milestone’s evidence will be evaluated against.

Current State

APs can boot into kernel idle loops, and CPUs 0-3 can temporarily own scheduler/user work when their LAPIC timers are available. Specific assumptions that Phase C must still remove:

ComponentFileAssumption
Syscall stack switchingkernel/src/arch/x86_64/syscall.rs, kernel/src/arch/x86_64/percpu.rsSyscall entry/exit uses KernelGsBase/swapgs and GS-relative PerCpu stack fields on the running CPU
AP GDT, TSS, kernel stackskernel/src/arch/x86_64/gdt.rs, kernel/src/arch/x86_64/smp.rsAP-local descriptor tables and stacks exist, and AP TSS.RSP0 updates during AP scheduler context switches
IDTkernel/src/arch/x86_64/idt.rsSingle static IDT (shareable – IDT can be the same across CPUs)
SYSCALL MSRskernel/src/arch/x86_64/syscall.rs, kernel/src/arch/x86_64/smp.rsSTAR/LSTAR/SFMASK/EFER are initialized on BSP and parked APs; BSP and AP startup both publish KernelGsBase
Current thread and run queueskernel/src/sched.rs, kernel/src/arch/x86_64/percpu.rsSCHEDULER owns per-CPU current slots, per-CPU WFQ runnable queues ordered by virtual_finish_ns, bounded stealing from sibling queues, and wake placement through WakePolicy::QueueCpu; queued and direct-IPC wakeups iterate eligible idle scheduler CPUs and wake the first that accepts a fresh reschedule IPI, and CPUs 0-3 can temporarily own scheduler/user execution when their LAPIC timers are available, while shared-lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred
Timer/IPI deliverykernel/src/arch/x86_64/context.rs, kernel/src/arch/x86_64/lapic.rs, kernel/src/arch/x86_64/pic.rs, kernel/src/arch/x86_64/pit.rs, kernel/src/arch/x86_64/tlb.rsCPUs 0-3 use PIT-calibrated LAPIC timer vector 48 with LAPIC EOI when online; vector 49 services TLB shootdown and bounded reschedule requests
Frame allocatorkernel/src/mem/frame.rsSingle global ALLOCATOR behind one spinlock
Heap allocatorkernel/src/mem/heap.rslinked_list_allocator behind one spinlock

The first checkpoint removed the separate syscall RSP globals and made the BSP PerCpu layout the owner of syscall stack state. The GS checkpoint now uses KernelGsBase/swapgs for those offsets on syscall paths. The LAPIC checkpoint removed the PIT/PIC interrupt dependency from the normal BSP scheduler tick, kept PIT channel 2 as the LAPIC calibration source, installed the spurious vector, and wired the IPI vector. The TLB checkpoint added resident CPU masks, vector-49 shootdown, pending generation counters, completion waits, and syscall-entry plus flush-before-user-return hooks for delayed maskable interrupt delivery. The AP scheduler-owner checkpoint added per-CPU current slots and AP cpu=1 scheduler entry. The remaining Phase C assumptions are in concurrent run-queue ownership and reschedule routing, not in syscall stack lookup, the primary timer source, user page-table mutation invalidation, or AP TSS updates.


Phase A: Per-CPU Foundation

Establish per-CPU data structures on the BSP. No APs are started yet – this phase makes the BSP’s own code SMP-ready so Phase B is a clean addition.

Per-CPU Data Region

Each CPU needs a private data area accessible via the GS segment base. On x86_64, swapgs switches between user-mode GS (usually zero) and kernel-mode GS (pointing to per-CPU data). The kernel sets KernelGSBase MSR on each CPU during init.

The BSP checkpoint originally reached this layout as BSP_PER_CPU+offset from assembly. Phase C now uses the same offsets through GS after swapgs on syscall entry.

#![allow(unused)]
fn main() {
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
    /// Self-pointer for accessing the struct from GS:0.
    self_ptr: *const PerCpu,
    /// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
    kernel_rsp: u64,
    /// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
    user_rsp: u64,
    /// Currently running thread on this CPU, if one is active.
    current_thread: Option<ThreadRef>,
    /// CPU index (0 = BSP).
    cpu_id: u32,
    /// LAPIC ID (from Limine MP info or CPUID).
    lapic_id: u32,
}
}

The previous checkpointed syscall entry stub used the same offsets via the BSP symbol:

movq %rsp, BSP_PER_CPU+16(%rip) ; PerCpu.user_rsp
movq BSP_PER_CPU+8(%rip), %rsp  ; PerCpu.kernel_rsp

The current syscall entry stub uses GS-relative addressing:

swapgs
movq %rsp, %gs:16          ; PerCpu.user_rsp
movq %gs:8, %rsp           ; PerCpu.kernel_rsp

And symmetrically on return:

movq %gs:16, %rsp          ; restore user RSP
swapgs
sysretq

Non-returning syscall paths need separate handling: exit, a blocking cap_enter, and a terminal ThreadControl.exitThread can leave the syscall entry path by building a CpuContext and restoring another thread with iretq. Those paths must restore user GS ownership before iretq, even though they never execute the normal sysretq epilogue.

Lock And Ownership Rules

PerCpu fields split by owner:

  • kernel_rsp and TSS.RSP0 are updated together through percpu::set_kernel_entry_stack.
  • user_rsp is written only by syscall entry assembly and read only while constructing a blocked-syscall CpuContext.
  • current_thread mirrors Scheduler.current; the scheduler lock remains the authority for choosing and validating the current thread.
  • cpu_id and lapic_id are immutable after CPU initialization.

Phase A keeps the global scheduler lock and process table. The PerCpu current field is not a second scheduler authority; it is the per-CPU execution cache that Phase B will use when multiple CPUs stop sharing one current slot.

Per-CPU GDT, TSS, and Stacks

Each CPU needs its own:

  • GDT – the TSS descriptor encodes a physical pointer to the CPU’s TSS, so each CPU needs a GDT with its own TSS entry. The segment layout (kernel CS/DS, user CS/DS) is identical across CPUs.
  • TSSprivilege_stack_table[0] (kernel stack for interrupts from Ring 3) and IST entries (double-fault stack) must be per-CPU.
  • Kernel stack – each CPU needs its own stack for syscall/interrupt handling. Current size: 16 KB (4 pages). Same size per CPU.
  • Double-fault stack – each CPU needs its own IST stack. Current size: 20 KB (5 pages).
#![allow(unused)]
fn main() {
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
    // Allocate kernel stack (4 pages) and double-fault stack (5 pages)
    let kernel_stack = alloc_stack(4);
    let df_stack = alloc_stack(5);

    // Create TSS with per-CPU stacks
    let mut tss = TaskStateSegment::new();
    tss.privilege_stack_table[0] = kernel_stack.top();
    tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();

    // Create GDT with this CPU's TSS
    let (gdt, selectors) = create_gdt(&tss);

    // Allocate and populate PerCpu struct
    let per_cpu = Box::leak(Box::new(PerCpu {
        self_ptr: core::ptr::null(),  // filled below
        kernel_rsp: kernel_stack.top().as_u64(),
        user_rsp: 0,
        current_thread: None,
        cpu_id,
        lapic_id,
    }));
    per_cpu.self_ptr = per_cpu as *const PerCpu;
    per_cpu
}
}

LAPIC Initialization

Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for preemption on the BSP. AP startup must initialize enough local-APIC state for secondary CPUs to park in a kernel idle loop and for later IPIs. Migrating BSP preemption from PIT to LAPIC timer is still required before multi-CPU scheduling, since the PIT is a single shared device that cannot provide per-CPU timer interrupts. LAPIC work is needed for:

  • Per-CPU timer – replace PIT with LAPIC timer (required for SMP)
  • IPI – inter-processor interrupts for TLB shootdown and AP startup
  • Spurious interrupt vector – must be configured per-CPU

2026-04-25 research decision: the immediate Phase C LAPIC/IPI foundation uses xAPIC MMIO, LAPIC timer vector 48, IPI vector 49, LAPIC EOI, AP LAPIC initialization, and PIT/PIC fallback. The grounding note x2APIC and APIC virtualization records the checked Intel and QEMU/KVM sources and keeps x2APIC as a later backend rather than a reason to rework the current LAPIC gate.

Crate Dependencies

CratePurposeno_std
manual xAPIC MMIO backendcurrent LAPIC timer, EOI, IPI, spurious vector foundationyes
future manual x2APIC MSR backend using x86_64 MSR accessnewer/high-core systems and firmware states where xAPIC is unavailable or undesirableyes

The current LAPIC path uses xAPIC MMIO through the kernel MMIO mapper. The later x2APIC backend should still be small and explicit rather than adding an APIC abstraction crate: read the APIC ID, enable x2APIC through IA32_APIC_BASE, program the spurious-vector register, local-vector timer, timer divide/initial-count registers, EOI, and ICR sends through MSRs. I/O APIC remains separate MMIO hardware discovered through ACPI MADT and belongs to the later interrupt-infrastructure/cloud path.

Migration Path

Phase A was a refactor of existing single-CPU code, not an addition:

  1. Add PerCpu struct, allocate one instance for BSP. Done for BSP static storage.
  2. Set BSP’s KernelGSBase MSR, add swapgs to syscall entry/exit. Done for syscall entry/exit, including syscall-to-iretq exits.
  3. Replace SYSCALL_KERNEL_RSP/SYSCALL_USER_RSP globals with per-CPU accesses. Done; syscall assembly uses GS-relative PerCpu offsets.
  4. Replace scheduler’s global SCHEDULER.current with PerCpu.current_thread. Partially done: the BSP per-CPU record mirrors Scheduler.current; the scheduler lock remains authoritative for current-thread and queue ownership until shared scheduler metadata is split further.
  5. Move GDT/TSS stack updates behind the per-CPU path. Done for the BSP runtime stack-update hook; AP-local GDT/TSS allocation belongs to Phase B.
  6. Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5). Done for the BSP timer path, with PIT used for calibration and PIT/PIC retained as a fallback.

After Phase A, the kernel still runs user work on one CPU but the BSP per-CPU plumbing is in place. Existing tests (make run-smoke and make run-spawn) continue to pass.


Phase B: AP Startup

Bring Application Processors (APs) online. Each AP runs the same kernel code with its own per-CPU state.

2026-04-25 grounding checkpoint: the next implementation slice should use the current local limine crate’s MP API, not the older SmpRequest naming used in some protocol examples. In capOS’s pinned crate, limine::request::MpRequest returns architecture-specific limine::mp::MpRespData; x86_64 CPU records are limine::mp::MpInfo values with processor_id, lapic_id, MpInfo::bootstrap(entry, extra_arg), and MpInfo::extra_argument(). The Phase B implementation is split into two checkpoints: first enumerate CPUs, assign dense capOS CPU ids separately from Limine’s ACPI processor_id, and allocate AP state/stack slots; then bind each non-BSP CPU to a slot via extra_arg, start it with bootstrap, and park it in a kernel idle loop after local CPU initialization. Both checkpoints are implemented; APs still must not run userspace or mutate the global scheduler.

Limine MP Request

Limine provides an MP response with per-CPU records. Each x86_64 record contains an ACPI processor id, LAPIC ID, and an atomic boot handoff. In the local limine crate, callers should use MpInfo::bootstrap() rather than writing the raw goto_addr field directly.

#![allow(unused)]
fn main() {
use limine::request::MpRequest;

static MP_REQUEST: MpRequest = MpRequest::new(0);

fn start_aps() {
    let mp = MP_REQUEST.response().expect("no MP response");
    let mut next_cpu_id = 1;
    for cpu in mp.cpus() {
        if cpu.lapic_id == mp.bsp_lapic_id {
            continue; // skip BSP
        }
        let cpu_id = next_cpu_id;
        next_cpu_id += 1;
        record_boot_processor_id(cpu_id, cpu.processor_id);
        let ap = init_ap_record(cpu_id, cpu.processor_id, cpu.lapic_id);
        cpu.bootstrap(ap_entry, ap as *const ApCpu as u64);
    }
}
}

AP Entry

Each AP must:

  1. Switch to the capOS kernel PML4 and AP-owned kernel stack
  2. Enable per-CPU CR4 state used by the kernel page tables and user-access guards
  3. Load its per-CPU GDT and TSS
  4. Load the shared IDT
  5. Set KernelGSBase MSR to its PerCpu pointer
  6. Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
  7. Signal “ready” to BSP (atomic flag or counter)
  8. Enter a parked kernel idle loop

Local APIC timer setup and IPI handling remain separate Stage 7 gates; parked APs keep interrupts disabled until that work is ready.

#![allow(unused)]
fn main() {
/// AP entry point. Called by Limine with the MP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::mp::MpInfo) -> ! {
    let ap_ptr = info.extra_argument() as *const ApCpu;
    let ap = unsafe {
        ap_ptr
            .as_ref()
            .expect("Limine AP extra_argument must be an ApCpu pointer")
    };
    let per_cpu = ap.per_cpu();

    // Switch from Limine state to capOS-owned paging and AP stack.
    ap.switch_to_kernel_paging_and_stack();

    // Match per-CPU CR4 state after the kernel PML4 is live.
    paging::enable_global_pages_on_current_cpu();
    smap::init();

    // Load this CPU's GDT + TSS
    ap.descriptors.load();

    // Shared IDT (same across all CPUs)
    idt::init();

    // Set GS base for swapgs
    unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }

    // Configure syscall MSRs (same values as BSP)
    syscall::init_msrs();

    // Signal ready
    ap.online.store(true, Ordering::Release);
    AP_READY_COUNT.fetch_add(1, Ordering::AcqRel);

    // Park until a later scheduler milestone gives APs runnable work.
    ap_idle_loop();
}
}

The extra_argument pointer must name an initialized, non-null ApCpu record whose storage outlives the AP. The BSP publishes that record before calling MpInfo::bootstrap(), and the AP treats the contained PerCpu pointer as CPU-local state after entry.

Scheduler Boundary

Phase B does not extend the Stage 5 scheduler. The BSP remains the only CPU that runs userspace or mutates the global scheduler. APs only run enough kernel initialization to prove that per-CPU architectural state is valid, signal ready, and park in a bounded hlt loop.

Per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing that chooses the most-overdue runnable sibling candidate, bounded idle-to-runnable wake targeting that walks eligible idle scheduler CPUs, and address-space CPU residency tracking are the current Phase C structure. The temporary 2026-05-02 single-global-runnable-queue collapse is historical; Scheduler Evolution Phase D (closed 2026-05-10) reintroduced per-CPU queues with weighted fair ordering, and Phase E closed SchedulingContext bind/revoke, budget, donation/return, and depletion notification on top of that. Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress, the first automatic nohz activation increment closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md, and SQPOLL-driven auto-nohz activation closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed. Generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz issuance remain future work. CPU affinity policy, shared scheduler metadata splitting, scheduler-driven AP idle policy, broader workload classes, higher-thread-count evidence, and the named Phase F.5 16/32-core scalability proof remain Phase C/F follow-ups. The first Phase C scheduler proof may continue to use the current process ring while the runtime serializes ring consumption. Full SMP where sibling threads from one process wait independently on different CPUs should use the Ring v2 direction in Ring v2 For Full SMP: cap_enter waits on the current thread’s CQ, not on a shared process CQ.

Boot Sequence

BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
  AP1: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
  AP2: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
  ...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler

Phase C: SMP Correctness

With APs parked in kernel idle loops, Phase C makes user scheduling safe on more than one CPU. The order is:

  1. Move syscall entry/exit and per-CPU access to KernelGsBase/swapgs so APs do not use BSP-symbol-relative syscall stack fields. This includes non-sysretq paths that block or exit through scheduler iretq restore. Done for syscall stack fields and syscall-originated restore paths.
  2. Add LAPIC timer and IPI support so each CPU can take local scheduler ticks and receive cross-CPU requests. Done for PIT-calibrated BSP LAPIC ticks, parked-AP LAPIC initialization, spurious-vector handling, vector 49, a bounded vector-49-only fixed IPI send primitive, live TLB shootdown users, and bounded idle-to-runnable reschedule requests.
  3. Add TLB shootdown before any user address space can run on more than one CPU over its lifetime. Done for user page-table map/unmap/protect through resident CPU masks, vector-49 shootdown, pending full-TLB flush generations, completion waits, and syscall-entry/flush-before-user-return hooks. Remote AP targets become active when AP scheduler ownership records AP residency.
  4. Split scheduler current/run-queue ownership into per-CPU state, with a reviewed AP idle-to-runnable handoff. Done for per-CPU current-thread slots, the first AP cpu=1 scheduler owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting; shared scheduler lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred.
  5. Prove the existing manifest/ring/thread/park smokes under -smp 2.

With multiple CPUs running scheduler-owned work, shared mutable state needs careful handling.

TLB Shootdown

When the kernel modifies page tables that other CPUs may have cached in their TLBs, it must send an IPI to those CPUs to invalidate the affected entries.

Scenarios requiring shootdown:

  • Process exit – unmapping user pages. Only the CPU running the process has the mapping cached, but if the process migrated recently, stale TLB entries may exist on the old CPU.
  • Shared kernel mappings – changes to the kernel half of page tables (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
  • Capability-granted shared memory – if future stages allow shared memory regions between processes, modifications require targeted shootdown.

Current code uses local mapper flushes in AddressSpace::map, AddressSpace::unmap, and AddressSpace::protect, then calls the serialized shootdown helper with the address space’s resident CPU mask. Those methods are reached from VirtualMemoryCap’s parse_map, parse_unmap, and parse_protect anonymous mapping paths and MemoryObjectCap::{map,unmap,protect} borrowed mapping paths. Scheduler CR3 handoff marks the selected address space resident on the current CPU, including AP cpu=1 during the first AP scheduler-owner proof.

Implementation state consists of vector 49, a resident CPU target mask, and per-CPU pending full-TLB flush generations. The first implementation records pending flush generations for online resident CPUs other than the caller, after the local page-table edit and local flush complete, then sends vector-49 IPIs to prompt immediate drain and returns a completion token. VM capability handlers enqueue completion work after dropping the address-space guard, and cap_enter or timer polling drains the queue after ring dispatch releases cap-table and scratch locks. Handlers reserve fixed-size queue slots before page-table mutation, so overload is reported before rollback, unmap, or protect can mutate state. Drains flush the current CPU before waiting, so a CPU that is itself in the target mask cannot wait on its own pending generation. A target CPU that is already in a syscall and contending on those same locks can eventually reach the IPI or return-path drain. If a target CPU has maskable interrupts delayed while it runs a kernel path, it still drains its pending generation at syscall entry or before returning to userspace from syscall, timer, or scheduler restore paths.

#![allow(unused)]
fn main() {
fn shootdown_page(resident_cpu_mask: u64) {
    let targets = resident_cpu_mask & online_cpu_mask() & !current_cpu_bit();
    let generation = next_shootdown_generation();
    for cpu_id in targets {
        PENDING_FLUSH_GENERATION[cpu_id].store(generation, Ordering::Release);
        lapic::send_fixed_ipi(lapic_id_for_cpu(cpu_id));
    }
    ShootdownCompletion { targets, generation }
}

fn flush_pending_for_current_cpu() {
    while pending_generation(current_cpu_id()) != flushed_generation(current_cpu_id()) {
        let generation = pending_generation(current_cpu_id());
        x86_64::instructions::tlb::flush_all();
        FLUSHED_GENERATION[current_cpu_id()].store(generation, Ordering::Release);
    }
}
}

The first implementation targets the address space’s resident CPU mask rather than every online CPU so parked APs with interrupts disabled are not disturbed. It relies on kernel user-buffer access continuing through address-space-locked HHDM copy/read helpers rather than raw user virtual addresses while a delayed flush generation exists. Broader range and page-level coalescing can be added after AP scheduling exists.

LAPIC/IPI Boundary

The normal timer path is now local-APIC-backed: vector 48 handles scheduler ticks with LAPIC EOI after PIT-channel-2 calibration, vector 49 handles TLB shootdown and bounded idle-to-runnable reschedule requests, vector 255 handles LAPIC spurious interrupts without EOI, and vector 32 remains only for the PIT/PIC fallback. AP scheduler owners program their LAPIC timers from the BSP calibration before entering the scheduler-owner loop; if AP timer setup is unavailable, the BSP keeps scheduler ownership. The remaining LAPIC/IPI work is broader scheduler-driven AP idle policy, future preemptive reschedule policy, and a later x2APIC MSR backend after the architectural xAPIC MMIO path is correct, not the bounded idle-to-runnable wake request path.

The TLB shootdown IPI handler must not allocate and must not take locks that can be held while sending a shootdown. Completion waits must happen after dropping the mutated address space’s lock and ring dispatch’s cap-table/scratch locks. The deferred completion queue must remain bounded, non-allocating at enqueue, and reserved before page-table mutation. Syscall-entry and user-return paths must drain pending flush generations so delayed maskable IPI delivery cannot leave a target CPU unable to observe completion or resume a thread with stale TLB state.

KVM paravirtual features such as kvm-pv-eoi, kvm-pv-ipi, and kvm-pv-tlb-flush are future performance work. They must not be required for the first LAPIC timer, IPI, or TLB-shootdown correctness proofs.

Lock Audit

Existing spinlocks need review for SMP safety:

LockCurrent UseSMP Concern
SERIALCOM1 outputSafe but high contention if many CPUs print. Acceptable for debug output.
ALLOCATORFrame bitmapHot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists.
KERNEL_CAPSKernel cap tableLow contention (init only). Safe.
SCHEDULER.currentSingle global running-thread slotSplit into PerCpu.current_thread in Phase A.

Before APs can run userspace, the scheduler also needs an explicit CPU residency record for each live thread or address space. That record drives TLB shootdown targeting and prevents migration from racing page-table changes. Process exit and thread exit must clear residency before freeing stacks, address spaces, or ring state that another CPU might still observe.

Interrupt + spinlock deadlock: if CPU A holds a spinlock and takes an interrupt whose handler tries to acquire the same lock, deadlock. This is already noted in REVIEW.md. Fix: disable interrupts while holding locks that interrupt handlers may need (frame allocator, serial). The spin crate supports MutexIrq for this pattern, or use manual cli/sti wrappers.

Allocator Scaling

The frame allocator is behind a single spinlock with O(n) bitmap scan. Under SMP, this becomes a contention bottleneck.

Options (in order of complexity):

  1. Per-CPU free list cache – each CPU maintains a small cache of free frames (e.g., 64 frames). Refill from the global allocator when empty, return batch when full. Reduces lock acquisitions by ~64x.
  2. Region partitioning – divide physical memory into per-CPU regions. Each CPU owns a bitmap partition. Cross-CPU allocation falls back to a global lock. More complex, better NUMA behavior (future).

Option 1 is recommended for initial SMP. ~50-100 lines.

The heap allocator (linked_list_allocator) is also behind a single lock. For a research OS this is acceptable initially – heap allocations in the kernel should be infrequent compared to frame allocations.


Cap’n Proto Schema Additions

SMP introduces a kernel-internal CpuManager capability for inspecting and controlling CPU state. This is not exposed to userspace initially but follows the “everything is a capability” principle.

interface CpuManager {
    # Number of online CPUs.
    cpuCount @0 () -> (count :UInt32);

    # Per-CPU info.
    cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}

This capability would be held by init (or a system monitor process) for diagnostics. It’s additive and can be deferred until the mechanism is useful.


Estimated Scope

PhaseNew/Changed CodeDepends On
Phase A: BSP per-CPU foundationDone (BSP PerCpu, syscall-stack storage, scheduler mirror, stack-update hook)Stage 5
Phase B: AP startupDone (MpRequest, AP records/stacks, AP CR3/RSP handoff, parked idle)Phase A
Phase C: Multi-CPU schedulingIn progress (GS/swapgs migration, LAPIC timer/IPI with EOI, shootdown-aware VM mutation wrappers, pending TLB generation completion, per-CPU current slots, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open)Phase B
Ring v2 for full SMPTBD (per-thread rings, completion routing, SQPOLL ownership)Phase C plus threading/park
TotalTBD after Phase C hardware/scheduler audit

Milestones

  • M1: Per-CPU data on BSP – BSP PerCpu syscall-stack/current-thread state, BSP per-CPU kernel-entry stack hook, and single-CPU QEMU proofs. Done.
  • M2: APs running – secondary CPUs reach idle_loop(). BSP prints “N CPUs online”. make run still runs init on BSP. Done.
  • M3: TLB shootdown – page table modifications are safe across CPUs. Process exit on one CPU doesn’t leave stale mappings on others. Done for address-space resident masks and AP cpu=1 residency marking.
  • M4: Multi-CPU scheduling – processes can run on any CPU. The existing boot-manifest service set still works, but the scheduler distributes work across CPUs once runnable processes are available (runtime spawning still depends on ProcessSpawner). Temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open.
  • M5: Ring v2 completion ownership – every live thread can own a ring endpoint; endpoint, timer, park, process-wait, and thread-join completions route by ThreadRef. This is the target for full SMP where sibling threads in one process wait independently on different CPUs.

Open Questions

  1. x2APIC backend. Phase C currently has an xAPIC MMIO LAPIC foundation. A later x2APIC MSR backend is still needed for newer/high-core systems and firmware states where xAPIC is unavailable or locked out; it should not block TLB shootdown on the current implementation path.

  2. Idle strategy. hlt is the simplest idle. mwait is more power-efficient and can be used to wake on memory writes. Overkill for QEMU, but worth noting for future hardware targets.

  3. CPU hotplug. Limine starts all CPUs at boot. Runtime CPU online/offline is a future concern, not needed initially.

  4. NUMA awareness. Multi-socket systems have non-uniform memory access. Per-CPU frame allocator regions could be NUMA-aware. Deferred – QEMU emulates flat memory by default.

  5. Scheduler policy. The current multi-CPU scheduler uses per-CPU WFQ runnable queues ordered by virtual_finish_ns under the shared scheduler lock, with bounded stealing from sibling queues when a CPU has no local runnable entry. Scheduler Evolution Phase D (per-CPU WFQ and bounded stealing, closed 2026-05-10) and Phase E (SchedulingContext bind/revoke, budget, donation/return, depletion notification) are closed against this substrate; Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress; the first automatic nohz activation increment and SQPOLL-driven auto-nohz activation are both closed (see docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md and docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md). The older round-robin/global-overflow starting point is historical, not the current baseline. Future refinements are shared-lock reduction, temporary pinning replacement, stronger CPU-affinity/admission policy, broader workload-class evidence, higher-thread-count evidence, and the Phase F.5 full-SMP 16/32-core scalability proof.


References

Specifications

Limine

Virtualization

Prior Art

  • Redox SMP – per-CPU contexts, LAPIC timer, IPI-based TLB shootdown
  • xv6-riscv SMP – minimal multi-core OS, clean per-CPU implementation
  • Hermit SMP – Rust unikernel with SMP support via per-core data and APIC
  • BlogOS – educational x86_64 Rust OS (single-CPU, but good APIC coverage)