Capability Ring
The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.
The current error model is documented in Error Handling. Ring CQE status values report transport failures; typed capability exceptions and ordinary schema result unions sit above that transport layer.
Current Behavior
Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page
contains a volatile header, a 16-entry submission queue, and a 32-entry
completion queue. Userspace writes CapSqe records, advances sq_tail, and
uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.
sequenceDiagram
participant U as Userspace runtime
participant R as Ring page
participant K as Kernel ring dispatcher
participant C as Capability object
U->>R: write CapSqe and advance sq_tail
U->>K: cap_enter(min_complete, timeout_ns)
K->>R: read sq_head..sq_tail
K->>K: validate SQE fields and lock AddressSpace for user buffers
K->>C: call method or endpoint operation
C-->>K: completion, pending, or error
K->>R: write CapCqe and advance cq_tail
K-->>U: return available CQE count
U->>R: read matching CapCqe
Timer polling also processes each current process’s ring before preemption, but
only non-CALL operations and CALL targets that explicitly allow interrupt
dispatch may run there. Ordinary CALLs wait for cap_enter.
Why ordinary CALL waits for
cap_enter: Submitting aCALLSQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects.cap_enteris the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited toexitandcap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.
Design
CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table
slot and method ID plus parameter/result buffers. CAP_OP_RECV and
CAP_OP_RETURN implement endpoint IPC. CAP_OP_RETURN normally returns
successful result bytes to the original caller; with
CAP_SQE_RETURN_APPLICATION_EXCEPTION, its payload is a serialized
CapException and the original caller completes with
CAP_ERR_APPLICATION_EXCEPTION or the truncated application-exception code.
CAP_OP_RELEASE removes a local cap-table slot through the transport.
CAP_OP_CANCEL (opcode 6) cancels a pending endpoint receive posted by the
same process on the same endpoint cap; pipeline_dep carries the receive
SQE’s user_data. CAP_OP_NOP measures the fixed ring path.
CAP_OP_PARK_BENCH (opcode 7) is a measurement-only compact opcode dispatched
only by kernels built with the measure feature; normal kernels reject it as
malformed. CAP_OP_FINISH is ABI-reserved and currently returns
CAP_ERR_UNSUPPORTED_OPCODE.
CAP_OP_RELEASE is deliberately scoped to local transport cleanup. It removes
one holder’s cap-table slot after the SQE is processed, or as part of process
exit cleanup; it does not revoke peer-held caps, cancel delegated authority, or
stand in for an application close method. Services that need security-visible
invalidation must use an explicit control path such as CapabilityManager.revoke,
session expiry, object epochs, or a service-specific close/revoke protocol.
Reviewers should treat claims based only on handle drop, RAII, GC finalizers, or
queued release flushing as local-cleanup claims, not revocation claims.
Opcode boundary: Ring opcodes are kernel ABI, not a loophole around the syscall surface.
cap_enterandexitremain the CPU trap entrypoints, but every accepted authority-bearing or resource-mutatingCAP_OP_*still adds distinct kernel semantics that must pass the capability method / ring opcode / syscall decision graph. No-authority diagnostics such asCAP_OP_NOPare still kernel ABI and must stay side-effect-free and review-visible, but they are not resource authority paths.CAP_OP_PARKandCAP_OP_UNPARKare justified because blocking wait mutates scheduler state, must be thread-owned on the process ring, reserves completion credit for later wake/timeout delivery, and needs compact capability-authorized hot-path framing. They are not a precedent for moving ordinary object methods into the opcode table for convenience.
CAP_OP_CALL may set CAP_SQE_THREAD_OWNED with call_id equal to the owning
thread id. If another thread drains the shared process ring first, the kernel
leaves that SQE at the head instead of consuming it and returns a distinct
owner-head cap_enter result instead of blocking the non-owner behind it. This
is limited to context-sensitive self-thread operations such as
ThreadControl.exitThread; ordinary runtime submissions leave call_id = 0.
CAP_OP_PARK and CAP_OP_UNPARK are compact capability-authorized
operations for process-local ParkSpace. Wait SQEs must set
CAP_SQE_THREAD_OWNED with call_id equal to the owning thread id; a
non-owner cap_enter leaves the SQE at the head just like a thread-owned CALL.
They reject promise-pipeline fields and run only from syscall-context ring
dispatch, not timer polling. A blocking wait consumes the SQE but posts no
caller CQE immediately; instead it reserves one waiter CQE credit, parks the
current thread, and later completes with a non-negative park status. Ordinary
CQE posting treats reserved park credits as unavailable so wake and timeout
delivery cannot lose waiter completions.
The kernel copies user params into preallocated per-process scratch, dispatches
capability methods, copies serialized results into caller-provided result
buffers, and posts CapCqe. Current-process user copies and transfer-descriptor
loads hold the caller’s AddressSpace mutex across permission validation and
the actual HHDM-backed copy/read. A successful method returns non-negative bytes
written. Transport failures are negative CAP_ERR_* codes. Application
exceptions are serialized CapException payloads with
CAP_ERR_APPLICATION_EXCEPTION. Ordinary capability implementation errors and
live endpoint CALL/RETURN target errors use this application-exception path
once a valid target cap or accepted endpoint relationship has been identified;
malformed ring metadata, bad user buffers, lookup failures, and endpoint
rollback/transfer failures stay in the transport namespace.
Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records
after the params/result payload. Successful result-cap transfers append
CapTransferResult records after normal result bytes.
Promise-pipelined CALLs remain rejected by current kernels. When that flag is
enabled, pipeline_dep names a process-local promised-answer identifier, and
pipeline_field selects a zero-based CapTransferResult record from that
answer’s completion. It is not a Cap’n Proto schema field number or payload
path. The kernel resolves dependencies only through the sideband result-cap
records it already owns; normal result bytes stay opaque to the transport.
Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.
Choosing A Capability Method, Ring Opcode, Or Syscall
New kernel functionality should default to a normal typed capability method. The small syscall surface is only the trap surface; the ring opcode table is also a reviewed kernel ABI and must stay narrow. The decision tree below is a full-page reference in the PDF because the branches are easier to read at diagram scale than as compressed prose.
flowchart TD
Start[New kernel-visible operation] --> Ambient{Must it run without any held capability?}
Ambient -- yes --> Trap{Is it process lifecycle or kernel-entry control?}
Trap -- yes --> Syscall[Consider a syscall]
Trap -- no --> RejectAmbient[Reject or redesign around explicit authority]
Ambient -- no --> CapMethod{Can it be expressed as a typed object method?}
CapMethod -- no --> Redesign[Redesign the authority object or transport contract]
CapMethod -- yes --> Hot{Is generic Cap'n Proto CALL materially wrong?}
Hot -- no --> Method[Use CAP_OP_CALL to a capability method]
Hot -- yes --> RingSpecific{Does it need ring/scheduler-specific semantics?}
RingSpecific -- no --> Method
RingSpecific -- yes --> Stable{Is the compact SQE/CQE ABI stable and capability-authorized?}
Stable -- no --> MethodOrDesign[Keep a capability method or write a reviewed design first]
Stable -- yes --> Opcode[Consider a new CAP_OP_* opcode]
Use a normal capability method when the operation is control plane, policy driven, service-specific, infrequent, or naturally represented by Cap’n Proto params/results. Process spawning, credential checks, storage naming, shell or network policy, virtual-memory control-plane calls, and most device-specific commands belong here unless measurement and design review prove otherwise.
Consider a compact ring opcode only when all of these are true:
- The operation is a hot path or scheduler path where generic Cap’n Proto framing is materially wrong.
- The operation has a small, stable field layout that fits the existing SQE/CQE model without per-interface ad hoc extensions.
- It needs ring-specific behavior such as thread ownership, reserved completion credit, CQ ordering/backpressure, asynchronous completion delivery, or interaction with the process ring head.
- It remains authorized by a held capability in
cap_id, not by ambient process identity or guessed kernel object names. - It cannot be handled as a normal capability method plus a future generated fast client without losing an essential scheduler or transport invariant.
Consider a new syscall only when the operation is about entering or leaving the kernel execution context itself and cannot sensibly be authorized by a capability already available to the process. That bar is intentionally higher than the opcode bar. Ordinary resource operations should not become syscalls just because they are common.
Full-SMP Direction
The current process-wide ring is not the target ABI for full SMP. Once sibling threads in one process can run on different CPUs, a shared process CQ would force userspace to serialize completion consumption or the kernel to invent specific-wait state on top of circular-buffer slots.
The selected future direction is per-thread ring ownership, documented in
Ring v2 For Full SMP. In that model,
cap_enter(min_complete, timeout_ns) keeps its current aggregate wait shape,
but the aggregate is the current thread’s CQ. Completion paths post by
generation-checked ThreadRef, while result-cap transfers and authority still
belong to the process cap table.
The first Ring v2 implementation should use kernel-chosen child-thread ring
mappings. The initial fixed RING_VADDR mapping becomes a compatibility
special case backed by the same RingEndpoint lifetime and waiter rules as
child-thread rings. Runtime-supplied ring address ranges are deferred until
VirtualMemory can reserve a ring arena without racing ordinary mappings.
The initial Phase C multi-CPU scheduler proof may continue to use the current process-wide ring as long as userspace serializes ring consumption. Ring v2 is the target for full SMP with sibling threads from one process running and waiting independently on different CPUs.
A runtime reactor can bridge the current process-wide ring for multithreaded
runtimes before Ring v2: one runtime-owned drainer consumes the process CQ,
matches completions by user_data, and wakes waiting threads through
ParkSpace. That bridge is not the full-SMP kernel ABI.
Invariants
- SQ and CQ sizes are powers of two and fixed by the ABI.
- Unknown opcodes fail closed;
FINISHis reserved, not silently accepted. - Reserved fields must be zero for currently implemented opcodes, except
CAP_SQE_THREAD_OWNEDCALL and PARK SQEs may carry the owning thread id incall_id. - Park PARK/UNPARK SQEs must keep unsupported fields zero and must not be dispatched from timer context.
cap_enterrejectsmin_complete > CQ_ENTRIES.- User-buffer validation and copy/read must hold the owning process
AddressSpacemutex for CALL params/results, RECV result buffers, RETURN payloads, transfer descriptors, and deferred same-process completions. - Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
- Per-dispatch SQE processing is bounded by
SQ_ENTRIES. - Transfer descriptors must be aligned, valid, and bounded by
MAX_TRANSFER_DESCRIPTORS. - Promise-pipelined dependency resolution must use sideband
CapTransferResultordinals, never general Cap’n Proto result traversal in the kernel.
Code Map
capos-config/src/ring.rs- shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.kernel/src/cap/ring.rs- kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.kernel/src/arch/x86_64/syscall.rs-cap_entersyscall.kernel/src/sched.rs- timer polling, cap-enter blocking, direct IPC wake.kernel/src/process.rs- ring page allocation and mapping.capos-rt/src/ring.rs- runtime ring client, pending calls, transfer packing, result-cap parsing.capos-rt/src/entry.rs- single-owner runtime ring client token and release queue flushing.capos-config/tests/ring_loom.rs- bounded producer/consumer model.
Validation
cargo test-ring-loomvalidates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.make runexercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.make run-measureexercises measurement-only counters, dispatch segment cycle summaries, the NullCap baseline, the ParkBench compact-versus-generic comparison, and the real ParkSpace blocked/resume timing path.cargo test-configcovers shared ring layout and helper invariants.make capos-rt-checkchecks userspace runtime ring code under the bare-metal target.
Open Work
- Implement
CAP_OP_FINISHas part of the system Cap’n Proto transport. - Implement promise pipelining using the reserved
pipeline_depanswer ID andpipeline_fieldresult-cap ordinal mapping. - Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
- Add runtime-level ParkSpace wrappers and completion demultiplexing on top of the compact opcodes.
- Add the runtime reactor bridge for multithreaded use of the current process ring, then replace it as the kernel fast path with per-thread Ring v2 completion ownership.
- Add SQPOLL after SMP gives the kernel a spare execution context.