Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Ring v2 For Full SMP

How capOS should evolve the capability ring once multiple threads from one process can run concurrently on multiple CPUs.

The current ring design is intentionally process-wide: one ring page per process, one SQ, one CQ, and one blocked cap_enter waiter admitted per process. That was the right first threading milestone because it preserved the existing transport while moving scheduler identity from process ids to generation-checked ThreadRef values.

That design can support an initial multi-CPU scheduler proof if the runtime continues to serialize process-ring consumption. It should not be the endpoint for full SMP where sibling threads from one process run and wait on different CPUs. A single process CQ forces those sibling threads to coordinate completion consumption in userspace and keeps the kernel from knowing which thread should block for which CQ stream. The full-SMP target is per-thread ring ownership.

Design Grounding

The local research files checked before this design were:

  • docs/research/completion-ring-threading.md;
  • docs/research/out-of-kernel-scheduling.md;
  • docs/research/llvm-target.md;
  • docs/research/sel4.md;
  • docs/research/zircon.md.

The relevant result is that efficient shared rings want clear producer/consumer ownership. Linux io_uring uses user_data to identify requests, but its aggregate wait model does not by itself solve multiple user consumers waiting on one raw CQ. Futexes provide the right user-runtime parking primitive for compatibility demux. Windows IOCP is a shared completion packet queue model, which is useful as a runtime abstraction but should not be confused with letting several kernel-blocked threads wait on the same circular CQ storage.

Target Model

Each live process thread owns one capability ring endpoint. A ring endpoint is a complete SQ/CQ pair with one userspace-visible identity; it may be mapped as one page per thread or as a lane in a larger ring bundle, but a lane is not just a CQ attached to a shared process SQ.

Each endpoint has:

  • one userspace SQ/CQ pair;
  • one kernel RingScratch or equivalent dispatch scratch owned by that thread or by the ring endpoint;
  • one blocked cap_enter waiter for that thread’s CQ;
  • one ring address passed to the thread at startup.

The process remains the authority boundary. Address space, cap table, CapSet, and resource accounting stay process-owned. Result-cap transfers still install capabilities into the process cap table. Per-thread rings only split transport progress and completion ownership.

cap_enter(min_complete, timeout_ns) keeps its current syscall shape, but the meaning becomes:

Process pending SQEs for the current thread’s ring, then block the current thread until at least min_complete CQEs are available on that same thread’s CQ, or until timeout.

Userspace still matches individual requests by user_data within the current thread’s CQ. The kernel does not add slot-specific waits; CQ slots are storage, not durable request identities.

Thread Creation And Bootstrap

The initial thread may keep the legacy fixed RING_VADDR mapping during the transition. Additional threads need unique ring mappings because all threads share one address space.

ThreadSpawner.create should grow in one of two reviewed ways:

  1. kernel chooses a free ring virtual address and passes it in the child start registers; or
  2. runtime reserves a user virtual address range and supplies the desired ring address to ThreadSpawner.create.

The first option is simpler for early SMP. The second option gives language runtimes tighter arena control and can follow once VirtualMemory reservation semantics are richer.

The child thread entry contract should continue to pass bootstrap register values equivalent to:

  • RDI = arg;
  • RSI = tid;
  • RDX = pid;
  • RCX = thread_ring_addr;
  • R8 = CAPSET_VADDR, or zero if absent.

For the initial process thread, _start keeps receiving the ring address from the loader ABI. Once every userspace binary uses the runtime-provided ring address instead of assuming RING_VADDR, the fixed mapping can become a bootstrap-only compatibility detail.

When Ring v2 introduces versioned SQE/CQE layouts, the register-level ring address handoff becomes one field of the negotiated runtime boot record:

#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
    ring_addr: u64,
    ring_abi_version: u32,
    sqe_size: u16,
    cqe_size: u16,
}
}

RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in capos-config/src/ring.rs. Kernel code and capos-rt must import the shared definition instead of maintaining parallel boot-ABI structs.

The Tickless/Realtime proposal owns the first CapSqeV2 use case (deadline_ns, qos_flags, and sched_ctx_id), but Ring v2 owns the transport rule: every thread ring handoff must carry or imply the same ABI version and entry sizes that cap_enter validates. A runtime must not infer CapSqeV2 from the address alone.

Completion Routing

Any kernel record that can later post a CQE must store a target ThreadRef and post to that thread’s ring after generation validation:

  • ordinary CALL completions target the submitting thread;
  • endpoint RECV completions target the receiver thread;
  • endpoint RETURN completions target the original caller thread;
  • Timer.sleep completions target the sleeping thread;
  • ProcessHandle.wait completions target the waiting thread;
  • ThreadHandle.join completions target the joining thread;
  • ParkSpace wait wake/timeout completions target the waiting thread;
  • deferred endpoint cancellation completions target the thread that posted the cancelable operation.

Process exit cancels every ring owned by the process. Thread exit cancels that thread’s own ring operations and wakes/drops waiters that name its ThreadRef. If a thread exits with outstanding operations that can still complete, the kernel must either cancel them before releasing the ring or hold the ring record until all generation-checked completion paths drain.

Normative lifetime invariant: a ring record cannot be freed while any CPU, waiter, endpoint call, timer waiter, park waiter, cancellation path, deferred completion path, or SQPOLL worker can still post to it. Thread exit either cancels every such record first or keeps the ring record alive until all generation-checked completion paths have drained.

SQPOLL And Kernel Consumers

Each thread ring must have exactly one kernel SQ consumer at a time:

  • syscall mode: the owner thread’s cap_enter drains its own SQ;
  • SQPOLL mode: a kernel worker drains that ring’s SQ, and cap_enter waits for CQ availability and returns counts. Userspace remains the CQ consumer.

Mode changes require quiescing the ring so cap_enter and SQPOLL do not both consume the same SQ. SQPOLL workers should be bound through scheduler policy or future CPU grants after APs run kernel idle loops and per-CPU scheduling exists.

Timer interrupt polling may continue to process bounded interrupt-safe work for the current thread’s ring in syscall mode, but it must not become a second SQ consumer for an SQPOLL-owned ring.

Full-nohz for SQPOLL is a later CPU-isolation contract, not part of initial Ring v2. A poller CPU may suppress the periodic scheduler tick only when a housekeeping CPU remains online, the SQPOLL worker is the only runnable entity on that CPU, no timer-side SQ polling or transitional network scheduler polling is pinned there, and CPU accounting is boundary/counter driven rather than tick-driven. The broader staging is in Tickless and Realtime Scheduling.

Scheduler And SMP Requirements

Per-thread rings are not sufficient for full SMP by themselves. Multi-CPU userspace scheduling also requires:

  • per-CPU current-thread state as the scheduler authority, not only a BSP mirror;
  • per-CPU run queues plus a migration/work-stealing protocol;
  • a current-CPU field for runnable/running threads plus an address-space active-CPU mask, or equivalent target set, for TLB shootdown;
  • TLB shootdown before a thread can migrate or two threads in one address space can run on different CPUs while mappings change;
  • cap-table locking or finer object locks that tolerate concurrent calls from sibling threads;
  • address-space locking rules for concurrent VirtualMemory operations, process exit, and user-buffer copy paths;
  • process and thread ring cleanup that cannot free a ring while another CPU is posting a completion to it.

The first Phase C multi-CPU scheduler smoke may keep the current process ring if the runtime still serializes process-ring consumption. A later full-SMP smoke that runs sibling threads from one process concurrently on different CPUs should wait for per-thread ring completion routing and TLB shootdown review.

Compatibility Bridge

Before Ring v2, capos-rt can support multithreaded programs on the current process ring with a runtime reactor:

  • one runtime-owned waiter drains the process CQ;
  • ordinary client threads block on runtime wait records using ParkSpace;
  • the reactor matches CQEs by user_data and unparks the waiting thread.

This is a bridge, not the final SMP ABI. It is useful for validating runtime logic and higher-level language support before kernel per-thread rings land.

Rejected Direction: Slot-Specific cap_enter

Do not extend cap_enter to wait for raw CQ slots. Slots are circular-buffer storage and can be reused after cq_head advances. A correct specific-wait design would need stable request ids or completion tokens, at which point per-thread ring endpoints solve the same ownership problem with less special-case kernel state.

Roadmap

  1. Runtime reactor bridge on the current process ring.
  2. Ring allocation/accounting moved from process-only state to thread-owned ring records, while preserving the initial fixed bootstrap ring.
  3. ThreadSpawner.create allocates/maps a per-thread ring and passes its user address to the child.
  4. Scheduler waiters and endpoint/timer/park/process/thread completion paths post by target ThreadRef to that thread’s ring.
  5. cap_enter operates on the current thread’s ring; remove the one-process-ring waiter rule.
  6. Add SQPOLL mode only after per-CPU scheduler state exists.
  7. Add SQPOLL nohz only after CPU isolation leases, housekeeping placement, non-tick CPU accounting, and network polling placement are reviewed.
  8. Run full-SMP sibling-thread workloads that wait independently on different CPUs only after per-thread ring routing, TLB shootdown, and cross-CPU cleanup rules are reviewed.