Proposal: Ring v2 For Full SMP
How capOS should evolve the capability ring once multiple threads from one process can run concurrently on multiple CPUs.
The current ring design is intentionally process-wide: one ring page per
process, one SQ, one CQ, and one blocked cap_enter waiter admitted per
process. That was the right first threading milestone because it preserved the
existing transport while moving scheduler identity from process ids to
generation-checked ThreadRef values.
That design can support an initial multi-CPU scheduler proof if the runtime continues to serialize process-ring consumption. It should not be the endpoint for full SMP where sibling threads from one process run and wait on different CPUs. A single process CQ forces those sibling threads to coordinate completion consumption in userspace and keeps the kernel from knowing which thread should block for which CQ stream. The full-SMP target is per-thread ring ownership.
Design Grounding
The local research files checked before this design were:
docs/research/completion-ring-threading.md;docs/research/out-of-kernel-scheduling.md;docs/research/llvm-target.md;docs/research/sel4.md;docs/research/zircon.md.
The relevant result is that efficient shared rings want clear producer/consumer
ownership. Linux io_uring uses user_data to identify requests, but its
aggregate wait model does not by itself solve multiple user consumers waiting
on one raw CQ. Futexes provide the right user-runtime parking primitive for
compatibility demux. Windows IOCP is a shared completion packet queue model,
which is useful as a runtime abstraction but should not be confused with
letting several kernel-blocked threads wait on the same circular CQ storage.
Target Model
Each live process thread owns one capability ring endpoint. A ring endpoint is a complete SQ/CQ pair with one userspace-visible identity; it may be mapped as one page per thread or as a lane in a larger ring bundle, but a lane is not just a CQ attached to a shared process SQ.
Each endpoint has:
- one userspace SQ/CQ pair;
- one kernel
RingScratchor equivalent dispatch scratch owned by that thread or by the ring endpoint; - one blocked
cap_enterwaiter for that thread’s CQ; - one ring address passed to the thread at startup.
The process remains the authority boundary. Address space, cap table, CapSet, and resource accounting stay process-owned. Result-cap transfers still install capabilities into the process cap table. Per-thread rings only split transport progress and completion ownership.
cap_enter(min_complete, timeout_ns) keeps its current syscall shape, but the
meaning becomes:
Process pending SQEs for the current thread’s ring, then block the current thread until at least
min_completeCQEs are available on that same thread’s CQ, or until timeout.
Userspace still matches individual requests by user_data within the current
thread’s CQ. The kernel does not add slot-specific waits; CQ slots are storage,
not durable request identities.
Thread Creation And Bootstrap
The initial thread may keep the legacy fixed RING_VADDR mapping during the
transition. Additional threads need unique ring mappings because all threads
share one address space.
ThreadSpawner.create should grow in one of two reviewed ways:
- kernel chooses a free ring virtual address and passes it in the child start registers; or
- runtime reserves a user virtual address range and supplies the desired ring
address to
ThreadSpawner.create.
The first option is simpler for early SMP. The second option gives language
runtimes tighter arena control and can follow once VirtualMemory reservation
semantics are richer.
The child thread entry contract should continue to pass bootstrap register values equivalent to:
RDI = arg;RSI = tid;RDX = pid;RCX = thread_ring_addr;R8 = CAPSET_VADDR, or zero if absent.
For the initial process thread, _start keeps receiving the ring address from
the loader ABI. Once every userspace binary uses the runtime-provided ring
address instead of assuming RING_VADDR, the fixed mapping can become a
bootstrap-only compatibility detail.
When Ring v2 introduces versioned SQE/CQE layouts, the register-level ring address handoff becomes one field of the negotiated runtime boot record:
#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
ring_addr: u64,
ring_abi_version: u32,
sqe_size: u16,
cqe_size: u16,
}
}
RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in
capos-config/src/ring.rs. Kernel code and capos-rt must import the shared
definition instead of maintaining parallel boot-ABI structs.
The Tickless/Realtime proposal owns the first CapSqeV2 use case
(deadline_ns, qos_flags, and sched_ctx_id), but Ring v2 owns the transport
rule: every thread ring handoff must carry or imply the same ABI version and
entry sizes that cap_enter validates. A runtime must not infer CapSqeV2
from the address alone.
Completion Routing
Any kernel record that can later post a CQE must store a target ThreadRef and
post to that thread’s ring after generation validation:
- ordinary CALL completions target the submitting thread;
- endpoint RECV completions target the receiver thread;
- endpoint RETURN completions target the original caller thread;
Timer.sleepcompletions target the sleeping thread;ProcessHandle.waitcompletions target the waiting thread;ThreadHandle.joincompletions target the joining thread;- ParkSpace wait wake/timeout completions target the waiting thread;
- deferred endpoint cancellation completions target the thread that posted the cancelable operation.
Process exit cancels every ring owned by the process. Thread exit cancels that
thread’s own ring operations and wakes/drops waiters that name its ThreadRef.
If a thread exits with outstanding operations that can still complete, the
kernel must either cancel them before releasing the ring or hold the ring
record until all generation-checked completion paths drain.
Normative lifetime invariant: a ring record cannot be freed while any CPU, waiter, endpoint call, timer waiter, park waiter, cancellation path, deferred completion path, or SQPOLL worker can still post to it. Thread exit either cancels every such record first or keeps the ring record alive until all generation-checked completion paths have drained.
SQPOLL And Kernel Consumers
Each thread ring must have exactly one kernel SQ consumer at a time:
- syscall mode: the owner thread’s
cap_enterdrains its own SQ; - SQPOLL mode: a kernel worker drains that ring’s SQ, and
cap_enterwaits for CQ availability and returns counts. Userspace remains the CQ consumer.
Mode changes require quiescing the ring so cap_enter and SQPOLL do not both
consume the same SQ. SQPOLL workers should be bound through scheduler policy or
future CPU grants after APs run kernel idle loops and per-CPU scheduling exists.
Timer interrupt polling may continue to process bounded interrupt-safe work for the current thread’s ring in syscall mode, but it must not become a second SQ consumer for an SQPOLL-owned ring.
Full-nohz for SQPOLL is a later CPU-isolation contract, not part of initial Ring v2. A poller CPU may suppress the periodic scheduler tick only when a housekeeping CPU remains online, the SQPOLL worker is the only runnable entity on that CPU, no timer-side SQ polling or transitional network scheduler polling is pinned there, and CPU accounting is boundary/counter driven rather than tick-driven. The broader staging is in Tickless and Realtime Scheduling.
Scheduler And SMP Requirements
Per-thread rings are not sufficient for full SMP by themselves. Multi-CPU userspace scheduling also requires:
- per-CPU current-thread state as the scheduler authority, not only a BSP mirror;
- per-CPU run queues plus a migration/work-stealing protocol;
- a current-CPU field for runnable/running threads plus an address-space active-CPU mask, or equivalent target set, for TLB shootdown;
- TLB shootdown before a thread can migrate or two threads in one address space can run on different CPUs while mappings change;
- cap-table locking or finer object locks that tolerate concurrent calls from sibling threads;
- address-space locking rules for concurrent
VirtualMemoryoperations, process exit, and user-buffer copy paths; - process and thread ring cleanup that cannot free a ring while another CPU is posting a completion to it.
The first Phase C multi-CPU scheduler smoke may keep the current process ring if the runtime still serializes process-ring consumption. A later full-SMP smoke that runs sibling threads from one process concurrently on different CPUs should wait for per-thread ring completion routing and TLB shootdown review.
Compatibility Bridge
Before Ring v2, capos-rt can support multithreaded programs on the current
process ring with a runtime reactor:
- one runtime-owned waiter drains the process CQ;
- ordinary client threads block on runtime wait records using ParkSpace;
- the reactor matches CQEs by
user_dataand unparks the waiting thread.
This is a bridge, not the final SMP ABI. It is useful for validating runtime logic and higher-level language support before kernel per-thread rings land.
Rejected Direction: Slot-Specific cap_enter
Do not extend cap_enter to wait for raw CQ slots. Slots are circular-buffer
storage and can be reused after cq_head advances. A correct specific-wait
design would need stable request ids or completion tokens, at which point
per-thread ring endpoints solve the same ownership problem with less
special-case kernel state.
Roadmap
- Runtime reactor bridge on the current process ring.
- Ring allocation/accounting moved from process-only state to thread-owned ring records, while preserving the initial fixed bootstrap ring.
ThreadSpawner.createallocates/maps a per-thread ring and passes its user address to the child.- Scheduler waiters and endpoint/timer/park/process/thread completion paths
post by target
ThreadRefto that thread’s ring. cap_enteroperates on the current thread’s ring; remove the one-process-ring waiter rule.- Add SQPOLL mode only after per-CPU scheduler state exists.
- Add SQPOLL nohz only after CPU isolation leases, housekeeping placement, non-tick CPU accounting, and network polling placement are reviewed.
- Run full-SMP sibling-thread workloads that wait independently on different CPUs only after per-thread ring routing, TLB shootdown, and cross-CPU cleanup rules are reviewed.