Proposal: Ring v2 For Full SMP
How capOS should evolve the capability ring once multiple threads from one process can run concurrently on multiple CPUs.
The current ring design is intentionally process-wide: one ring page per
process, one SQ, one CQ, and one blocked cap_enter waiter admitted per
process. That was the right first threading milestone because it preserved the
existing transport while moving scheduler identity from process ids to
generation-checked ThreadRef values.
That design can support an initial multi-CPU scheduler proof if the runtime continues to serialize process-ring consumption. It should not be the endpoint for full SMP where sibling threads from one process run and wait on different CPUs. A single process CQ forces those sibling threads to coordinate completion consumption in userspace and keeps the kernel from knowing which thread should block for which CQ stream. The full-SMP target is per-thread ring ownership.
Design Grounding
The local research files checked before this design were:
docs/research/completion-ring-threading.md;docs/research/out-of-kernel-scheduling.md;docs/research/llvm-target.md;docs/research/sel4.md;docs/research/zircon.md.
The relevant result is that efficient shared rings want clear producer/consumer
ownership. Linux io_uring uses user_data to identify requests, but its
aggregate wait model does not by itself solve multiple user consumers waiting
on one raw CQ. Futexes provide the right user-runtime parking primitive for
compatibility demux. Windows IOCP is a shared completion packet queue model,
which is useful as a runtime abstraction but should not be confused with
letting several kernel-blocked threads wait on the same circular CQ storage.
Target Model
Each live process thread owns one capability ring endpoint. A ring endpoint is a complete SQ/CQ pair with one userspace-visible identity; it may be mapped as one page per thread or as a lane in a larger ring bundle, but a lane is not just a CQ attached to a shared process SQ.
Each endpoint has:
- one userspace SQ/CQ pair;
- one kernel
RingScratchor equivalent dispatch scratch owned by that thread or by the ring endpoint; - one blocked
cap_enterwaiter for that thread’s CQ; - one ring address passed to the thread at startup.
The process remains the authority boundary. Address space, cap table, CapSet, and resource accounting stay process-owned. Result-cap transfers still install capabilities into the process cap table. Per-thread rings only split transport progress and completion ownership.
cap_enter(min_complete, timeout_ns) keeps its current syscall shape, but the
meaning becomes:
Process pending SQEs for the current thread’s ring, then block the current thread until at least
min_completeCQEs are available on that same thread’s CQ, or until timeout.
Userspace still matches individual requests by user_data within the current
thread’s CQ. The kernel does not add slot-specific waits; CQ slots are storage,
not durable request identities.
Thread Creation And Bootstrap
The initial thread may keep the legacy fixed RING_VADDR mapping during the
transition. Additional threads need unique ring mappings because all threads
share one address space.
The initial accepted contract is kernel-chosen ring mapping. ThreadSpawner
does not accept a caller-supplied ring address for the first Ring v2 slice.
The kernel allocates a ring record, maps that ring at a collision-free user
virtual address in the caller’s address space, charges it to the process
ledger, stores the address on the child ThreadRef, and passes the address in
the child start registers. If no ring mapping or record can be allocated, thread
creation fails before the child thread becomes runnable and rolls back all
thread and ring reservations.
Runtime-supplied ring address ranges remain a later extension. They need
reviewed VirtualMemory reservation semantics so the runtime can reserve a
ring arena without racing normal user mappings. Until that extension lands,
Ring v2 implementation branches must not add a ThreadSpawner.create parameter
for a caller-selected ring address.
The child thread entry contract should continue to pass bootstrap register values equivalent to:
RDI = arg;RSI = tid;RDX = pid;RCX = thread_ring_addr;R8 = CAPSET_VADDR, or zero if absent.
For the initial process thread, _start keeps receiving the ring address from
the loader ABI. Once every userspace binary uses the runtime-provided ring
address instead of assuming RING_VADDR, the fixed mapping can become a
bootstrap-only compatibility detail.
When Ring v2 introduces versioned SQE/CQE layouts, the register-level ring address handoff becomes one field of the negotiated runtime boot record:
#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
ring_addr: u64,
ring_abi_version: u32,
sqe_size: u16,
cqe_size: u16,
}
}
RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in
capos-config/src/ring.rs. Kernel code and capos-rt must import the shared
definition instead of maintaining parallel boot-ABI structs.
The first implementation may continue using the existing fixed SQE/CQE layout
and RING_VADDR for the initial thread. It still needs a shared ring-endpoint
descriptor in the kernel so initial-thread and child-thread rings use the same
lifetime, waiter, and completion-routing rules. The fixed initial mapping is a
compatibility special case, not a separate process-wide ring once Ring v2 is
enabled for a process.
The Tickless/Realtime proposal owns the first CapSqeV2 use case
(deadline_ns, qos_flags, and sched_ctx_id), but Ring v2 owns the transport
rule: every thread ring handoff must carry or imply the same ABI version and
entry sizes that cap_enter validates. A runtime must not infer CapSqeV2
from the address alone.
Completion Routing
Any kernel record that can later post a CQE must store a target ThreadRef and
post to that thread’s ring after generation validation:
- ordinary CALL completions target the submitting thread;
- endpoint RECV completions target the receiver thread;
- endpoint RETURN completions target the original caller thread;
Timer.sleepcompletions target the sleeping thread;ProcessHandle.waitcompletions target the waiting thread;ThreadHandle.joincompletions target the joining thread;- ParkSpace wait wake/timeout completions target the waiting thread;
- deferred endpoint cancellation completions target the thread that posted the cancelable operation.
Process exit cancels every ring owned by the process. Thread exit cancels that
thread’s own ring operations and wakes/drops waiters that name its ThreadRef.
If a thread exits with outstanding operations that can still complete, the
kernel must either cancel them before releasing the ring or hold the ring
record until all generation-checked completion paths drain.
Normative lifetime invariant: a ring record cannot be freed while any CPU, waiter, endpoint call, timer waiter, park waiter, cancellation path, deferred completion path, or SQPOLL worker can still post to it. Thread exit either cancels every such record first or keeps the ring record alive until all generation-checked completion paths have drained.
The implementation contract for completion routing is:
- scheduler state resolves
ThreadRef -> RingEndpointimmediately before posting a CQE; - a missing process, stale process generation, missing thread, stale thread generation, or closed ring endpoint turns the completion into a stale completion and must not write userspace memory;
- a ring endpoint stays pinned while a completion writer owns its reference;
- result-cap installation still targets the shared process cap table, but the CQE that names the installed result-cap slot is written only to the target thread’s CQ;
cap_enterdrains and waits on the current thread’s ring only; it never drains a sibling thread’s SQ and never waits on a process-wide CQ;- same-process thread scaling remains unclaimable until endpoint, timer, park,
process-wait, thread-join, deferred-cancel, and direct IPC completion paths
all follow this
ThreadRef -> RingEndpointrule.
SQPOLL And Kernel Consumers
Each thread ring must have exactly one kernel SQ consumer at a time:
- syscall mode: the owner thread’s
cap_enterdrains its own SQ; - SQPOLL mode: a kernel worker drains that ring’s SQ, and
cap_enterwaits for CQ availability and returns counts. Userspace remains the CQ consumer.
The Phase F prerequisite now makes this an explicit kernel-side lease for the
current per-thread ring endpoints. Syscall-mode dispatch has a
generation-checked owner covering both caller-driven cap_enter and bounded
timer-side current-thread ring service; a stale owner cannot advance SQ head,
and a duplicate future SQPOLL owner is rejected while the syscall owner is
live. This does not enable SQPOLL mode, nohz, or CPU isolation.
Mode changes require quiescing the ring so cap_enter and SQPOLL do not both
consume the same SQ. SQPOLL workers should be bound through scheduler policy or
future CPU grants after APs run kernel idle loops and per-CPU scheduling exists.
Timer interrupt polling may continue to process bounded interrupt-safe work for the current thread’s ring in syscall mode, but it must not become a second SQ consumer for an SQPOLL-owned ring.
Full-nohz for SQPOLL is a later CPU-isolation contract, not part of initial Ring v2. A poller CPU may suppress the periodic scheduler tick only when a housekeeping CPU remains online, the SQPOLL worker is the only runnable entity on that CPU, no timer-side SQ polling or transitional network scheduler polling is pinned there, and CPU accounting is boundary/counter driven rather than tick-driven. Phase F now reports explicit housekeeping/deferred-work placement or rejection for those prerequisites while keeping syscall-mode SQ ownership, periodic ticks, and SQPOLL disabled. The broader staging is in Tickless and Realtime Scheduling.
Scheduler And SMP Requirements
Per-thread rings are not sufficient for full SMP by themselves. Multi-CPU userspace scheduling also requires:
- per-CPU current-thread state as the scheduler authority, not only a BSP mirror;
- per-CPU run queues plus a migration/work-stealing protocol;
- a current-CPU field for runnable/running threads plus an address-space active-CPU mask, or equivalent target set, for TLB shootdown;
- TLB shootdown before a thread can migrate or two threads in one address space can run on different CPUs while mappings change;
- cap-table locking or finer object locks that tolerate concurrent calls from sibling threads;
- address-space locking rules for concurrent
VirtualMemoryoperations, process exit, and user-buffer copy paths; - process and thread ring cleanup that cannot free a ring while another CPU is posting a completion to it.
The first Phase C multi-CPU scheduler smoke may keep the current process ring if the runtime still serializes process-ring consumption. A later full-SMP smoke that runs sibling threads from one process concurrently on different CPUs should wait for per-thread ring completion routing and TLB shootdown review.
Compatibility Bridge
Before Ring v2, capos-rt can support multithreaded programs on the current
process ring with a runtime reactor:
- one runtime-owned waiter drains the process CQ;
- ordinary client threads block on runtime wait records using ParkSpace;
- the reactor matches CQEs by
user_dataand unparks the waiting thread.
This is a bridge, not the final SMP ABI. It is useful for validating runtime logic and higher-level language support before kernel per-thread rings land.
Rejected Direction: Slot-Specific cap_enter
Do not extend cap_enter to wait for raw CQ slots. Slots are circular-buffer
storage and can be reused after cq_head advances. A correct specific-wait
design would need stable request ids or completion tokens, at which point
per-thread ring endpoints solve the same ownership problem with less
special-case kernel state.
Roadmap
- Runtime reactor bridge on the current process ring.
- Add the shared
RingEndpointkernel record and make the initial fixed bootstrap ring use it without changing userspace behavior. - Move ring allocation/accounting from process-only state to thread-owned ring records.
ThreadSpawner.createallocates/maps a kernel-chosen per-thread ring and passes its user address to the child.- Scheduler waiters and endpoint/timer/park/process/thread completion paths
post by target
ThreadRefto that thread’s ring. cap_enteroperates on the current thread’s ring; remove the one-process-ring waiter rule.- Add SQPOLL mode only after per-CPU scheduler state exists.
- Add SQPOLL nohz only after CPU isolation leases, housekeeping placement, non-tick CPU accounting, and network polling placement are reviewed.
- Run full-SMP sibling-thread workloads that wait independently on different CPUs only after per-thread ring routing, TLB shootdown, and cross-CPU cleanup rules are reviewed.