Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Ring v2 For Full SMP

How capOS should evolve the capability ring once multiple threads from one process can run concurrently on multiple CPUs.

The current ring design is intentionally process-wide: one ring page per process, one SQ, one CQ, and one blocked cap_enter waiter admitted per process. That was the right first threading milestone because it preserved the existing transport while moving scheduler identity from process ids to generation-checked ThreadRef values.

That design can support an initial multi-CPU scheduler proof if the runtime continues to serialize process-ring consumption. It should not be the endpoint for full SMP where sibling threads from one process run and wait on different CPUs. A single process CQ forces those sibling threads to coordinate completion consumption in userspace and keeps the kernel from knowing which thread should block for which CQ stream. The full-SMP target is per-thread ring ownership.

Design Grounding

The local research files checked before this design were:

  • docs/research/completion-ring-threading.md;
  • docs/research/out-of-kernel-scheduling.md;
  • docs/research/llvm-target.md;
  • docs/research/sel4.md;
  • docs/research/zircon.md.

The relevant result is that efficient shared rings want clear producer/consumer ownership. Linux io_uring uses user_data to identify requests, but its aggregate wait model does not by itself solve multiple user consumers waiting on one raw CQ. Futexes provide the right user-runtime parking primitive for compatibility demux. Windows IOCP is a shared completion packet queue model, which is useful as a runtime abstraction but should not be confused with letting several kernel-blocked threads wait on the same circular CQ storage.

Target Model

Each live process thread owns one capability ring endpoint. A ring endpoint is a complete SQ/CQ pair with one userspace-visible identity; it may be mapped as one page per thread or as a lane in a larger ring bundle, but a lane is not just a CQ attached to a shared process SQ.

Each endpoint has:

  • one userspace SQ/CQ pair;
  • one kernel RingScratch or equivalent dispatch scratch owned by that thread or by the ring endpoint;
  • one blocked cap_enter waiter for that thread’s CQ;
  • one ring address passed to the thread at startup.

The process remains the authority boundary. Address space, cap table, CapSet, and resource accounting stay process-owned. Result-cap transfers still install capabilities into the process cap table. Per-thread rings only split transport progress and completion ownership.

cap_enter(min_complete, timeout_ns) keeps its current syscall shape, but the meaning becomes:

Process pending SQEs for the current thread’s ring, then block the current thread until at least min_complete CQEs are available on that same thread’s CQ, or until timeout.

Userspace still matches individual requests by user_data within the current thread’s CQ. The kernel does not add slot-specific waits; CQ slots are storage, not durable request identities.

Thread Creation And Bootstrap

The initial thread may keep the legacy fixed RING_VADDR mapping during the transition. Additional threads need unique ring mappings because all threads share one address space.

The initial accepted contract is kernel-chosen ring mapping. ThreadSpawner does not accept a caller-supplied ring address for the first Ring v2 slice. The kernel allocates a ring record, maps that ring at a collision-free user virtual address in the caller’s address space, charges it to the process ledger, stores the address on the child ThreadRef, and passes the address in the child start registers. If no ring mapping or record can be allocated, thread creation fails before the child thread becomes runnable and rolls back all thread and ring reservations.

Runtime-supplied ring address ranges remain a later extension. They need reviewed VirtualMemory reservation semantics so the runtime can reserve a ring arena without racing normal user mappings. Until that extension lands, Ring v2 implementation branches must not add a ThreadSpawner.create parameter for a caller-selected ring address.

The child thread entry contract should continue to pass bootstrap register values equivalent to:

  • RDI = arg;
  • RSI = tid;
  • RDX = pid;
  • RCX = thread_ring_addr;
  • R8 = CAPSET_VADDR, or zero if absent.

For the initial process thread, _start keeps receiving the ring address from the loader ABI. Once every userspace binary uses the runtime-provided ring address instead of assuming RING_VADDR, the fixed mapping can become a bootstrap-only compatibility detail.

When Ring v2 introduces versioned SQE/CQE layouts, the register-level ring address handoff becomes one field of the negotiated runtime boot record:

#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
    ring_addr: u64,
    ring_abi_version: u32,
    sqe_size: u16,
    cqe_size: u16,
}
}

RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in capos-config/src/ring.rs. Kernel code and capos-rt must import the shared definition instead of maintaining parallel boot-ABI structs.

The first implementation may continue using the existing fixed SQE/CQE layout and RING_VADDR for the initial thread. It still needs a shared ring-endpoint descriptor in the kernel so initial-thread and child-thread rings use the same lifetime, waiter, and completion-routing rules. The fixed initial mapping is a compatibility special case, not a separate process-wide ring once Ring v2 is enabled for a process.

The Tickless/Realtime proposal owns the first CapSqeV2 use case (deadline_ns, qos_flags, and sched_ctx_id), but Ring v2 owns the transport rule: every thread ring handoff must carry or imply the same ABI version and entry sizes that cap_enter validates. A runtime must not infer CapSqeV2 from the address alone.

Completion Routing

Any kernel record that can later post a CQE must store a target ThreadRef and post to that thread’s ring after generation validation:

  • ordinary CALL completions target the submitting thread;
  • endpoint RECV completions target the receiver thread;
  • endpoint RETURN completions target the original caller thread;
  • Timer.sleep completions target the sleeping thread;
  • ProcessHandle.wait completions target the waiting thread;
  • ThreadHandle.join completions target the joining thread;
  • ParkSpace wait wake/timeout completions target the waiting thread;
  • deferred endpoint cancellation completions target the thread that posted the cancelable operation.

Process exit cancels every ring owned by the process. Thread exit cancels that thread’s own ring operations and wakes/drops waiters that name its ThreadRef. If a thread exits with outstanding operations that can still complete, the kernel must either cancel them before releasing the ring or hold the ring record until all generation-checked completion paths drain.

Normative lifetime invariant: a ring record cannot be freed while any CPU, waiter, endpoint call, timer waiter, park waiter, cancellation path, deferred completion path, or SQPOLL worker can still post to it. Thread exit either cancels every such record first or keeps the ring record alive until all generation-checked completion paths have drained.

The implementation contract for completion routing is:

  • scheduler state resolves ThreadRef -> RingEndpoint immediately before posting a CQE;
  • a missing process, stale process generation, missing thread, stale thread generation, or closed ring endpoint turns the completion into a stale completion and must not write userspace memory;
  • a ring endpoint stays pinned while a completion writer owns its reference;
  • result-cap installation still targets the shared process cap table, but the CQE that names the installed result-cap slot is written only to the target thread’s CQ;
  • cap_enter drains and waits on the current thread’s ring only; it never drains a sibling thread’s SQ and never waits on a process-wide CQ;
  • same-process thread scaling remains unclaimable until endpoint, timer, park, process-wait, thread-join, deferred-cancel, and direct IPC completion paths all follow this ThreadRef -> RingEndpoint rule.

SQPOLL And Kernel Consumers

Each thread ring must have exactly one kernel SQ consumer at a time:

  • syscall mode: the owner thread’s cap_enter drains its own SQ;
  • SQPOLL mode: a kernel worker drains that ring’s SQ, and cap_enter waits for CQ availability and returns counts. Userspace remains the CQ consumer.

The Phase F prerequisite now makes this an explicit kernel-side lease for the current per-thread ring endpoints. Syscall-mode dispatch has a generation-checked owner covering both caller-driven cap_enter and bounded timer-side current-thread ring service; a stale owner cannot advance SQ head, and a duplicate future SQPOLL owner is rejected while the syscall owner is live. This does not enable SQPOLL mode, nohz, or CPU isolation.

Mode changes require quiescing the ring so cap_enter and SQPOLL do not both consume the same SQ. SQPOLL workers should be bound through scheduler policy or future CPU grants after APs run kernel idle loops and per-CPU scheduling exists.

Timer interrupt polling may continue to process bounded interrupt-safe work for the current thread’s ring in syscall mode, but it must not become a second SQ consumer for an SQPOLL-owned ring.

Full-nohz for SQPOLL is a later CPU-isolation contract, not part of initial Ring v2. A poller CPU may suppress the periodic scheduler tick only when a housekeeping CPU remains online, the SQPOLL worker is the only runnable entity on that CPU, no timer-side SQ polling or transitional network scheduler polling is pinned there, and CPU accounting is boundary/counter driven rather than tick-driven. Phase F now reports explicit housekeeping/deferred-work placement or rejection for those prerequisites while keeping syscall-mode SQ ownership, periodic ticks, and SQPOLL disabled. The broader staging is in Tickless and Realtime Scheduling.

Scheduler And SMP Requirements

Per-thread rings are not sufficient for full SMP by themselves. Multi-CPU userspace scheduling also requires:

  • per-CPU current-thread state as the scheduler authority, not only a BSP mirror;
  • per-CPU run queues plus a migration/work-stealing protocol;
  • a current-CPU field for runnable/running threads plus an address-space active-CPU mask, or equivalent target set, for TLB shootdown;
  • TLB shootdown before a thread can migrate or two threads in one address space can run on different CPUs while mappings change;
  • cap-table locking or finer object locks that tolerate concurrent calls from sibling threads;
  • address-space locking rules for concurrent VirtualMemory operations, process exit, and user-buffer copy paths;
  • process and thread ring cleanup that cannot free a ring while another CPU is posting a completion to it.

The first Phase C multi-CPU scheduler smoke may keep the current process ring if the runtime still serializes process-ring consumption. A later full-SMP smoke that runs sibling threads from one process concurrently on different CPUs should wait for per-thread ring completion routing and TLB shootdown review.

Compatibility Bridge

Before Ring v2, capos-rt can support multithreaded programs on the current process ring with a runtime reactor:

  • one runtime-owned waiter drains the process CQ;
  • ordinary client threads block on runtime wait records using ParkSpace;
  • the reactor matches CQEs by user_data and unparks the waiting thread.

This is a bridge, not the final SMP ABI. It is useful for validating runtime logic and higher-level language support before kernel per-thread rings land.

Rejected Direction: Slot-Specific cap_enter

Do not extend cap_enter to wait for raw CQ slots. Slots are circular-buffer storage and can be reused after cq_head advances. A correct specific-wait design would need stable request ids or completion tokens, at which point per-thread ring endpoints solve the same ownership problem with less special-case kernel state.

Roadmap

  1. Runtime reactor bridge on the current process ring.
  2. Add the shared RingEndpoint kernel record and make the initial fixed bootstrap ring use it without changing userspace behavior.
  3. Move ring allocation/accounting from process-only state to thread-owned ring records.
  4. ThreadSpawner.create allocates/maps a kernel-chosen per-thread ring and passes its user address to the child.
  5. Scheduler waiters and endpoint/timer/park/process/thread completion paths post by target ThreadRef to that thread’s ring.
  6. cap_enter operates on the current thread’s ring; remove the one-process-ring waiter rule.
  7. Add SQPOLL mode only after per-CPU scheduler state exists.
  8. Add SQPOLL nohz only after CPU isolation leases, housekeeping placement, non-tick CPU accounting, and network polling placement are reviewed.
  9. Run full-SMP sibling-thread workloads that wait independently on different CPUs only after per-thread ring routing, TLB shootdown, and cross-CPU cleanup rules are reviewed.