Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Symmetric Multi-Processing (SMP)

How capOS goes from single-CPU execution to utilizing all available processors.

This document has three phases: a per-CPU foundation (prerequisite plumbing), AP startup (bringing secondary CPUs online), and SMP correctness (making shared state safe under concurrency).

Depends on: Stage 5 (Scheduling) – needs a working timer, context switch, and run queue on the BSP before adding more CPUs.

Can proceed in parallel with: Stage 6 (IPC and Capability Transfer).


Current State

Everything is single-CPU. Specific assumptions that SMP breaks:

ComponentFileAssumption
Syscall stack switchingkernel/src/arch/x86_64/syscall.rsGlobal SYSCALL_KERNEL_RSP / SYSCALL_USER_RSP statics
GDT, TSS, kernel stackskernel/src/arch/x86_64/gdt.rsOne static GDT, one TSS, one kernel stack, one double-fault stack
IDTkernel/src/arch/x86_64/idt.rsSingle static IDT (shareable – IDT can be the same across CPUs)
SYSCALL MSRskernel/src/arch/x86_64/syscall.rsSTAR/LSTAR/SFMASK/EFER set once on BSP only
Current processkernel/src/sched.rsSCHEDULER with BTreeMap<Pid, Process> + current: Option<Pid> — single global behind Mutex
Frame allocatorkernel/src/mem/frame.rsSingle global ALLOCATOR behind one spinlock
Heap allocatorkernel/src/mem/heap.rslinked_list_allocator behind one spinlock

The comment in syscall.rs:12 already anticipates the fix: “Will be replaced by per-CPU data (swapgs) for SMP.”


Phase A: Per-CPU Foundation

Establish per-CPU data structures on the BSP. No APs are started yet – this phase makes the BSP’s own code SMP-ready so Phase B is a clean addition.

Per-CPU Data Region

Each CPU needs a private data area accessible via the GS segment base. On x86_64, swapgs switches between user-mode GS (usually zero) and kernel-mode GS (pointing to per-CPU data). The kernel sets KernelGSBase MSR on each CPU during init.

#![allow(unused)]
fn main() {
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
    /// Self-pointer for accessing the struct from GS:0.
    self_ptr: *const PerCpu,
    /// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
    kernel_rsp: u64,
    /// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
    user_rsp: u64,
    /// CPU index (0 = BSP).
    cpu_id: u32,
    /// LAPIC ID (from Limine SMP info or CPUID).
    lapic_id: u32,
    /// Pointer to the currently running process on this CPU.
    current_process: *mut Process,
}
}

The syscall entry stub changes from:

movq %rsp, SYSCALL_USER_RSP(%rip)
movq SYSCALL_KERNEL_RSP(%rip), %rsp

to:

swapgs
movq %rsp, %gs:16          ; PerCpu.user_rsp
movq %gs:8, %rsp           ; PerCpu.kernel_rsp

And symmetrically on return:

movq %gs:16, %rsp          ; restore user RSP
swapgs
sysretq

Per-CPU GDT, TSS, and Stacks

Each CPU needs its own:

  • GDT – the TSS descriptor encodes a physical pointer to the CPU’s TSS, so each CPU needs a GDT with its own TSS entry. The segment layout (kernel CS/DS, user CS/DS) is identical across CPUs.
  • TSSprivilege_stack_table[0] (kernel stack for interrupts from Ring 3) and IST entries (double-fault stack) must be per-CPU.
  • Kernel stack – each CPU needs its own stack for syscall/interrupt handling. Current size: 16 KB (4 pages). Same size per CPU.
  • Double-fault stack – each CPU needs its own IST stack. Current size: 20 KB (5 pages).
#![allow(unused)]
fn main() {
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
    // Allocate kernel stack (4 pages) and double-fault stack (5 pages)
    let kernel_stack = alloc_stack(4);
    let df_stack = alloc_stack(5);

    // Create TSS with per-CPU stacks
    let mut tss = TaskStateSegment::new();
    tss.privilege_stack_table[0] = kernel_stack.top();
    tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();

    // Create GDT with this CPU's TSS
    let (gdt, selectors) = create_gdt(&tss);

    // Allocate and populate PerCpu struct
    let per_cpu = Box::leak(Box::new(PerCpu {
        self_ptr: core::ptr::null(),  // filled below
        kernel_rsp: kernel_stack.top().as_u64(),
        user_rsp: 0,
        cpu_id,
        lapic_id,
        current_process: core::ptr::null_mut(),
    }));
    per_cpu.self_ptr = per_cpu as *const PerCpu;
    per_cpu
}
}

LAPIC Initialization

Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for preemption on the BSP. Phase A must migrate from PIT to LAPIC timer before bringing APs online, since the PIT is a single shared device that cannot provide per-CPU timer interrupts. Phase A sets up the full LAPIC, which is needed for:

  • Per-CPU timer – replace PIT with LAPIC timer (required for SMP)
  • IPI – inter-processor interrupts for TLB shootdown and AP startup
  • Spurious interrupt vector – must be configured per-CPU

Crate Dependencies

CratePurposeno_std
x2apic or manual MMIOLAPIC/IOAPIC accessyes

The x86_64 crate (already a dependency) provides MSR access. LAPIC register access can use the existing HHDM for MMIO, or x2apic crate for the MSR-based interface.

Migration Path

Phase A is a refactor of existing single-CPU code, not an addition:

  1. Add PerCpu struct, allocate one instance for BSP
  2. Set BSP’s KernelGSBase MSR, add swapgs to syscall entry/exit
  3. Replace SYSCALL_KERNEL_RSP/SYSCALL_USER_RSP globals with GS-relative accesses
  4. Replace scheduler’s global SCHEDULER.current with PerCpu.current_pid
  5. Move GDT/TSS creation into init_per_cpu(), call it for BSP
  6. Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5)

After Phase A, the kernel still runs on one CPU but the per-CPU plumbing is in place. Existing tests (make run) continue to pass.


Phase B: AP Startup

Bring Application Processors (APs) online. Each AP runs the same kernel code with its own per-CPU state.

Limine SMP Request

Limine provides an SMP response with per-CPU records. Each record contains the LAPIC ID and a goto_address field – writing a function pointer there starts the AP at that address.

#![allow(unused)]
fn main() {
use limine::request::SmpRequest;

#[used]
#[unsafe(link_section = ".requests")]
static SMP_REQUEST: SmpRequest = SmpRequest::new();

fn start_aps() {
    let smp = SMP_REQUEST.get_response().expect("no SMP response");
    for cpu in smp.cpus() {
        if cpu.lapic_id == smp.bsp_lapic_id {
            continue; // skip BSP
        }
        let per_cpu = init_per_cpu(cpu.id, cpu.lapic_id);
        // Limine starts the AP at ap_entry with the cpu info pointer
        cpu.goto_address.write(ap_entry);
    }
}
}

AP Entry

Each AP must:

  1. Load its per-CPU GDT and TSS
  2. Load the shared IDT
  3. Set KernelGSBase MSR to its PerCpu pointer
  4. Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
  5. Initialize its LAPIC (enable, set timer, set spurious vector)
  6. Signal “ready” to BSP (atomic flag or counter)
  7. Enter the scheduler idle loop
#![allow(unused)]
fn main() {
/// AP entry point. Called by Limine with the SMP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::smp::Cpu) -> ! {
    let per_cpu = /* retrieve PerCpu for this LAPIC ID */;

    // Load this CPU's GDT + TSS
    per_cpu.gdt.load();
    unsafe { load_tss(per_cpu.selectors.tss); }

    // Shared IDT (same across all CPUs)
    idt::load();

    // Set GS base for swapgs
    unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }

    // Configure syscall MSRs (same values as BSP)
    syscall::init_msrs();

    // Initialize local APIC
    lapic::init_local();

    // Signal ready
    AP_READY_COUNT.fetch_add(1, Ordering::Release);

    // Enter scheduler
    scheduler::idle_loop();
}
}

Scheduler Integration

Stage 5 establishes a single run queue. Phase B extends it:

  • Per-CPU run queues – each CPU pulls work from its local queue. Avoids global lock contention on the scheduler hot path.
  • Global overflow queue – when a CPU’s local queue is empty, it steals from the global queue (or from other CPUs’ queues).
  • CPU affinity – optional, not needed initially. All processes are eligible to run on any CPU.
  • Idle loop – when no work is available, hlt until the next timer interrupt or IPI.

The Process struct gains a cpu field indicating which CPU it’s currently running on (or None if queued).

Boot Sequence

BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
  AP1: ap_entry() → init GDT/TSS/LAPIC → idle_loop()
  AP2: ap_entry() → init GDT/TSS/LAPIC → idle_loop()
  ...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler

Phase C: SMP Correctness

With multiple CPUs running, shared mutable state needs careful handling.

TLB Shootdown

When the kernel modifies page tables that other CPUs may have cached in their TLBs, it must send an IPI to those CPUs to invalidate the affected entries.

Scenarios requiring shootdown:

  • Process exit – unmapping user pages. Only the CPU running the process has the mapping cached, but if the process migrated recently, stale TLB entries may exist on the old CPU.
  • Shared kernel mappings – changes to the kernel half of page tables (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
  • Capability-granted shared memory – if future stages allow shared memory regions between processes, modifications require targeted shootdown.

Implementation: IPI vector + bitmap of target CPUs + invlpg on each target. Linux uses a more sophisticated batching scheme, but a simple broadcast IPI with single-page invlpg is sufficient initially.

#![allow(unused)]
fn main() {
/// Flush TLB entry on all CPUs except the caller.
fn tlb_shootdown(addr: VirtAddr) {
    // Record the address to flush
    SHOOTDOWN_ADDR.store(addr.as_u64(), Ordering::Release);

    // Send IPI to all other CPUs
    lapic::send_ipi_all_excluding_self(TLB_SHOOTDOWN_VECTOR);

    // Wait for all CPUs to acknowledge
    wait_for_shootdown_ack();
}

/// IPI handler on receiving CPU.
fn handle_tlb_shootdown_ipi() {
    let addr = VirtAddr::new(SHOOTDOWN_ADDR.load(Ordering::Acquire));
    x86_64::instructions::tlb::flush(addr);
    SHOOTDOWN_ACK.fetch_add(1, Ordering::Release);
}
}

Lock Audit

Existing spinlocks need review for SMP safety:

LockCurrent UseSMP Concern
SERIALCOM1 outputSafe but high contention if many CPUs print. Acceptable for debug output.
ALLOCATORFrame bitmapHot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists.
KERNEL_CAPSKernel cap tableLow contention (init only). Safe.
SCHEDULER.currentSingle global running-process slotSplit into PerCpu.current_process in Phase A.

Interrupt + spinlock deadlock: if CPU A holds a spinlock and takes an interrupt whose handler tries to acquire the same lock, deadlock. This is already noted in REVIEW.md. Fix: disable interrupts while holding locks that interrupt handlers may need (frame allocator, serial). The spin crate supports MutexIrq for this pattern, or use manual cli/sti wrappers.

Allocator Scaling

The frame allocator is behind a single spinlock with O(n) bitmap scan. Under SMP, this becomes a contention bottleneck.

Options (in order of complexity):

  1. Per-CPU free list cache – each CPU maintains a small cache of free frames (e.g., 64 frames). Refill from the global allocator when empty, return batch when full. Reduces lock acquisitions by ~64x.
  2. Region partitioning – divide physical memory into per-CPU regions. Each CPU owns a bitmap partition. Cross-CPU allocation falls back to a global lock. More complex, better NUMA behavior (future).

Option 1 is recommended for initial SMP. ~50-100 lines.

The heap allocator (linked_list_allocator) is also behind a single lock. For a research OS this is acceptable initially – heap allocations in the kernel should be infrequent compared to frame allocations.


Cap’n Proto Schema Additions

SMP introduces a kernel-internal CpuManager capability for inspecting and controlling CPU state. This is not exposed to userspace initially but follows the “everything is a capability” principle.

interface CpuManager {
    # Number of online CPUs.
    cpuCount @0 () -> (count :UInt32);

    # Per-CPU info.
    cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}

This capability would be held by init (or a system monitor process) for diagnostics. It’s additive and can be deferred until the mechanism is useful.


Estimated Scope

PhaseNew/Changed CodeDepends On
Phase A: Per-CPU foundation~300-400 lines (PerCpu struct, swapgs migration, per-CPU GDT/TSS)Stage 5
Phase B: AP startup~200-300 lines (SmpRequest, ap_entry, scheduler extension)Phase A
Phase C: SMP correctness~200-300 lines (TLB shootdown, allocator cache, lock audit)Phase B
Total~700-1000 lines

Milestones

  • M1: Per-CPU data on BSPswapgs-based syscall entry, per-CPU GDT/TSS, global current-process state split. Single CPU still. make run passes.
  • M2: APs running – secondary CPUs reach idle_loop(). BSP prints “N CPUs online”. make run still runs init on BSP.
  • M3: Multi-CPU scheduling – processes can run on any CPU. The existing boot-manifest service set still works, but the scheduler distributes work across CPUs once runnable processes are available (runtime spawning still depends on ProcessSpawner).
  • M4: TLB shootdown – page table modifications are safe across CPUs. Process exit on one CPU doesn’t leave stale mappings on others.

Open Questions

  1. LAPIC vs x2APIC. Modern hardware supports x2APIC (MSR-based, no MMIO). Should we require x2APIC, support both, or start with xAPIC? QEMU supports both. x2APIC is simpler (no MMIO mapping needed).

  2. Idle strategy. hlt is the simplest idle. mwait is more power-efficient and can be used to wake on memory writes. Overkill for QEMU, but worth noting for future hardware targets.

  3. CPU hotplug. Limine starts all CPUs at boot. Runtime CPU online/offline is a future concern, not needed initially.

  4. NUMA awareness. Multi-socket systems have non-uniform memory access. Per-CPU frame allocator regions could be NUMA-aware. Deferred – QEMU emulates flat memory by default.

  5. Scheduler policy. Round-robin per-CPU queues with global overflow is the simplest starting point. Work stealing, priority scheduling, and CPU affinity are future refinements.


References

Specifications

Limine

Prior Art

  • Redox SMP – per-CPU contexts, LAPIC timer, IPI-based TLB shootdown
  • xv6-riscv SMP – minimal multi-core OS, clean per-CPU implementation
  • Hermit SMP – Rust unikernel with SMP support via per-core data and APIC
  • BlogOS – educational x86_64 Rust OS (single-CPU, but good APIC coverage)