Capability Ring

The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.

The current error model is documented in Error Handling. Ring CQE status values report transport failures; typed capability exceptions and ordinary schema result unions sit above that transport layer.

Current Behavior

Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page contains a volatile header, a 16-entry submission queue, and a 32-entry completion queue. Userspace writes CapSqe records, advances sq_tail, and uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.

sequenceDiagram
    participant U as Userspace runtime
    participant R as Ring page
    participant K as Kernel ring dispatcher
    participant C as Capability object
    U->>R: write CapSqe and advance sq_tail
    U->>K: cap_enter(min_complete, timeout_ns)
    K->>R: read sq_head..sq_tail
    K->>K: validate SQE fields and lock AddressSpace for user buffers
    K->>C: call method or endpoint operation
    C-->>K: completion, pending, or error
    K->>R: write CapCqe and advance cq_tail
    K-->>U: return available CQE count
    U->>R: read matching CapCqe

Timer polling also processes each current process’s ring before preemption, but only non-CALL operations and CALL targets that explicitly allow interrupt dispatch may run there. Ordinary CALLs wait for cap_enter.

Why ordinary CALL waits for cap_enter: Submitting a CALL SQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects. cap_enter is the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited to exit and cap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.

Design

CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table slot and method ID plus parameter/result buffers. CAP_OP_RECV and CAP_OP_RETURN implement endpoint IPC. CAP_OP_RETURN normally returns successful result bytes to the original caller; with CAP_SQE_RETURN_APPLICATION_EXCEPTION, its payload is a serialized CapException and the original caller completes with CAP_ERR_APPLICATION_EXCEPTION or the truncated application-exception code. CAP_OP_RELEASE removes a local cap-table slot through the transport. CAP_OP_CANCEL (opcode 6) cancels a pending endpoint receive posted by the same process on the same endpoint cap; pipeline_dep carries the receive SQE’s user_data. CAP_OP_NOP measures the fixed ring path. CAP_OP_PARK_BENCH (opcode 7) is a measurement-only compact opcode dispatched only by kernels built with the measure feature; normal kernels reject it as malformed. CAP_OP_MINT_SERVICE_FACET (opcode 10) mints a client-only service-object facet locally from an endpoint owner cap; it is detailed below. CAP_OP_FINISH is ABI-reserved and currently returns CAP_ERR_UNSUPPORTED_OPCODE.

CAP_OP_MINT_SERVICE_FACET names an endpoint owner cap in cap_id and stamps an owner-selected interface id (addr) and receiver cookie (pipeline_dep) onto a fresh ClientEndpoint facet, which the kernel inserts into the caller’s own cap-table and returns as a single CapTransferResult at result_addr – the same result-cap shape a CALL yields. This is the runtime expression of the same authority the spawn serviceObject grant already exercises: an endpoint owner may hand out client facets of its endpoint. Before this opcode the only way to mint one was as a side effect of ProcessSpawner.spawn, which forced services that mint per event (for example the network stack, one facet per accepted socket) to create a helper process and quiesce shared endpoint traffic across the blocking spawn round trip. The local opcode removes both: it is a bounded cap-table operation with no process creation and no cross-process round trip, so no quiesce window is needed to fence it. The mint is owner-only and fails closed for a non-owner source (a client/result/delegated facet, including a ProcessSpawnerEndpointResult cap, whose is_endpoint_owner() is false), a stale or missing cap_id, a too-small or unwritable result buffer, or a cap-table/heap exhaustion; on any failure no facet is installed. The minted facet carries no serve authority (CAP_OP_RECV/CAP_OP_RETURN stay reserved to the owner), and its transfer scope is inherited from the owner’s hold and never widened. Unlike the spawn serviceObject grant – which also lets a non-owner pass on a Copy facet it already holds to a spawned child – this opcode gates on ownership directly and admits no pass-on case: the caller mints into its own table, so honoring pass-on would let any Copy-facet holder re-mint an identical copy without owning the endpoint. It still computes the child hold through capos-lib’s resolve_endpoint_facet_delegation / resolve_service_object_facet (with the owner already established), so the facet’s shape cannot drift from the spawn path.

CAP_OP_RELEASE is deliberately scoped to local transport cleanup. It removes one holder’s cap-table slot after the SQE is processed, or as part of process exit cleanup; it does not revoke peer-held caps, cancel delegated authority, or stand in for an application close method. Services that need security-visible invalidation must use an explicit control path such as CapabilityManager.revoke, session expiry, object epochs, or a service-specific close/revoke protocol. Reviewers should treat claims based only on handle drop, RAII, GC finalizers, or queued release flushing as local-cleanup claims, not revocation claims.

Opcode boundary: Ring opcodes are kernel ABI, not a loophole around the syscall surface. cap_enter and exit remain the CPU trap entrypoints, but every accepted authority-bearing or resource-mutating CAP_OP_* still adds distinct kernel semantics that must pass the capability method / ring opcode / syscall decision graph. No-authority diagnostics such as CAP_OP_NOP are still kernel ABI and must stay side-effect-free and review-visible, but they are not resource authority paths. CAP_OP_PARK and CAP_OP_UNPARK are justified because blocking wait mutates scheduler state, must be thread-owned on the process ring, reserves completion credit for later wake/timeout delivery, and needs compact capability-authorized hot-path framing. They are not a precedent for moving ordinary object methods into the opcode table for convenience.

CAP_OP_CALL may set CAP_SQE_THREAD_OWNED with call_id equal to the owning thread id. If another thread drains the shared process ring first, the kernel leaves that SQE at the head instead of consuming it and returns a distinct owner-head cap_enter result instead of blocking the non-owner behind it. This is limited to context-sensitive self-thread operations such as ThreadControl.exitThread; ordinary runtime submissions leave call_id = 0.

CAP_OP_PARK and CAP_OP_UNPARK are compact capability-authorized operations for process-local ParkSpace. Wait SQEs must set CAP_SQE_THREAD_OWNED with call_id equal to the owning thread id; a non-owner cap_enter leaves the SQE at the head just like a thread-owned CALL. They reject promise-pipeline fields and run only from syscall-context ring dispatch, not timer polling. A blocking wait consumes the SQE but posts no caller CQE immediately; instead it reserves one waiter CQE credit, parks the current thread, and later completes with a non-negative park status. Ordinary CQE posting treats reserved park credits as unavailable so wake and timeout delivery cannot lose waiter completions.

The kernel copies user params into preallocated per-process scratch, dispatches capability methods, copies serialized results into caller-provided result buffers, and posts CapCqe. Current-process user copies and transfer-descriptor loads hold the caller’s AddressSpace mutex across permission validation and the actual HHDM-backed copy/read. A successful method returns non-negative bytes written. Transport failures are negative CAP_ERR_* codes. Application exceptions are serialized CapException payloads with CAP_ERR_APPLICATION_EXCEPTION. Ordinary capability implementation errors and live endpoint CALL/RETURN target errors use this application-exception path once a valid target cap or accepted endpoint relationship has been identified; malformed ring metadata, bad user buffers, lookup failures, and endpoint rollback/transfer failures stay in the transport namespace.

Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records after the params/result payload. Successful result-cap transfers append CapTransferResult records after normal result bytes.

Promise-pipelined CALLs use pipeline_dep as a process-local promised-answer identifier and pipeline_field as a zero-based CapTransferResult record ordinal from that answer’s completion. It is not a Cap’n Proto schema field number or payload path. The kernel resolves dependencies only through kernel-owned sideband result-cap records; normal result bytes stay opaque to the transport.

Kernel-served antecedents resolve from bounded per-drain RingScratch state. Endpoint antecedents may RETURN after the caller’s original drain, so they use a separate fixed 64-entry kernel table, capped at MAX_PROMISED_ANSWERS live entries per caller. Each entry is correlated by the caller’s full generation-tagged thread reference, a kernel epoch, and the userspace answer identifier. It also records the SQ tail published with the antecedent. A dependent waits unconsumed until the endpoint RETURN publishes the antecedent CQE; the kernel then wakes the submitting thread with a private CAP_ENTER_PROMISE_RETRY result, and the runtime cap_enter wrapper consumes it and re-enters so the caller resumes dispatch only through that frozen tail in its own syscall context — no cross-thread re-drain. Submissions published later cannot join the old promise scope or reuse its answer identifier.

Successful endpoint RETURN keeps the result-cap records in kernel-owned state until the frozen scope drains. Server death or another endpoint cancellation publishes the antecedent’s concrete failure and marks the promise failed, so the dependent receives CAP_ERR_PIPELINE_ANTECEDENT_FAILED. Caller thread or process teardown drops its records before mappings can be reused. Promised endpoint antecedents do not donate their scheduling context to the server; otherwise the endpoint return path could wake the caller before the dependent completion required by the original cap_enter(min_complete=2) exists.

The answer-table, broken-promise propagation, and explicit resource bounds are grounded in the CapTP comparison in Spritely, OCapN, and CapTP. Binding the continuation to generation-tagged local identity follows the stale-reference discipline documented in Fuchsia Zircon and synthesized in the Capability Systems Survey.

Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.

Choosing A Capability Method, Ring Opcode, Or Syscall

New kernel functionality should default to a normal typed capability method. The small syscall surface is only the trap surface; the ring opcode table is also a reviewed kernel ABI and must stay narrow. The decision tree below is a full-page reference in the PDF because the branches are easier to read at diagram scale than as compressed prose.

flowchart TD
    Start[New kernel-visible operation] --> Ambient{Must it run without any held capability?}
    Ambient -- yes --> Trap{Is it process lifecycle or kernel-entry control?}
    Trap -- yes --> Syscall[Consider a syscall]
    Trap -- no --> RejectAmbient[Reject or redesign around explicit authority]
    Ambient -- no --> CapMethod{Can it be expressed as a typed object method?}
    CapMethod -- no --> Redesign[Redesign the authority object or transport contract]
    CapMethod -- yes --> Hot{Is generic Cap'n Proto CALL materially wrong?}
    Hot -- no --> Method[Use CAP_OP_CALL to a capability method]
    Hot -- yes --> RingSpecific{Does it need ring/scheduler-specific semantics?}
    RingSpecific -- no --> Method
    RingSpecific -- yes --> Stable{Is the compact SQE/CQE ABI stable and capability-authorized?}
    Stable -- no --> MethodOrDesign[Keep a capability method or write a reviewed design first]
    Stable -- yes --> Opcode[Consider a new CAP_OP_* opcode]

Use a normal capability method when the operation is control plane, policy driven, service-specific, infrequent, or naturally represented by Cap’n Proto params/results. Process spawning, credential checks, storage naming, shell or network policy, virtual-memory control-plane calls, and most device-specific commands belong here unless measurement and design review prove otherwise.

Consider a compact ring opcode only when all of these are true:

The operation is a hot path or scheduler path where generic Cap’n Proto framing is materially wrong.
The operation has a small, stable field layout that fits the existing SQE/CQE model without per-interface ad hoc extensions.
It needs ring-specific behavior such as thread ownership, reserved completion credit, CQ ordering/backpressure, asynchronous completion delivery, or interaction with the process ring head.
It remains authorized by a held capability in cap_id, not by ambient process identity or guessed kernel object names.
It cannot be handled as a normal capability method plus a future generated fast client without losing an essential scheduler or transport invariant.

Consider a new syscall only when the operation is about entering or leaving the kernel execution context itself and cannot sensibly be authorized by a capability already available to the process. That bar is intentionally higher than the opcode bar. Ordinary resource operations should not become syscalls just because they are common.

Full-SMP Direction

The current process-wide ring is not the target ABI for full SMP. Once sibling threads in one process can run on different CPUs, a shared process CQ would force userspace to serialize completion consumption or the kernel to invent specific-wait state on top of circular-buffer slots.

The selected future direction is per-thread ring ownership, documented in Ring v2 For Full SMP. In that model, cap_enter(min_complete, timeout_ns) keeps its current aggregate wait shape, but the aggregate is the current thread’s CQ. Completion paths post by generation-checked ThreadRef, while result-cap transfers and authority still belong to the process cap table.

The first Ring v2 implementation should use kernel-chosen child-thread ring mappings. The initial fixed RING_VADDR mapping becomes a compatibility special case backed by the same RingEndpoint lifetime and waiter rules as child-thread rings. Runtime-supplied ring address ranges are deferred until VirtualMemory can reserve a ring arena without racing ordinary mappings.

The initial Phase C multi-CPU scheduler proof may continue to use the current process-wide ring as long as userspace serializes ring consumption. Ring v2 is the target for full SMP with sibling threads from one process running and waiting independently on different CPUs.

A runtime reactor can bridge the current process-wide ring for multithreaded runtimes before Ring v2: one runtime-owned drainer consumes the process CQ, matches completions by user_data, and wakes waiting threads through ParkSpace. That bridge is not the full-SMP kernel ABI.

Invariants

SQ and CQ sizes are powers of two and fixed by the ABI.
Unknown opcodes fail closed; FINISH is reserved, not silently accepted.
Reserved fields must be zero for currently implemented opcodes, except CAP_SQE_THREAD_OWNED CALL and PARK SQEs may carry the owning thread id in call_id.
Park PARK/UNPARK SQEs must keep unsupported fields zero and must not be dispatched from timer context.
MINT_SERVICE_FACET is owner-only (is_endpoint_owner()), keeps unsupported fields zero, is not dispatched from timer context, validates the result buffer before mutating any kernel state, and rolls the inserted facet back on any post-insert failure so the mint is all-or-nothing. The minted facet’s scope is inherited from the owner’s hold and never widened, and it carries no RECV/RETURN authority.
cap_enter rejects min_complete > CQ_ENTRIES.
User-buffer validation and copy/read must hold the owning process AddressSpace mutex for CALL params/results, RECV result buffers, RETURN payloads, transfer descriptors, and deferred same-process completions.
Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
Per-dispatch SQE processing is bounded by SQ_ENTRIES.
Transfer descriptors must be aligned, valid, and bounded by MAX_TRANSFER_DESCRIPTORS.
Promise-pipelined dependency resolution must use sideband CapTransferResult ordinals, never general Cap’n Proto result traversal in the kernel.
Cross-drain endpoint promises must be keyed by full caller generation, kernel epoch, and answer identifier; consume dependents only through the antecedent’s frozen SQ tail; and fail closed on teardown or cancellation.
Endpoint promise resolution must publish the antecedent CQE before making the dependent runnable, and must never reconstruct result caps from caller memory.

Code Map

capos-config/src/ring.rs - shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.
kernel/src/cap/ring.rs - kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.
kernel/src/arch/x86_64/syscall.rs - cap_enter syscall.
kernel/src/sched.rs - timer polling, cap-enter blocking, direct IPC wake, and bounded resumption of endpoint-promise scopes.
kernel/src/process.rs - ring page allocation and mapping.
capos-rt/src/ring.rs - runtime ring client, pending calls, transfer packing, result-cap parsing.
capos-rt/src/entry.rs - single-owner runtime ring client token and release queue flushing.
capos-config/tests/ring_loom.rs - bounded producer/consumer model.

Validation

cargo test-ring-loom validates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, corrupted SQ recovery, endpoint promise return/wake ordering, CQ-backpressure retry, and teardown suppression.
make run-promise-pipeline covers kernel-served and endpoint-served antecedents, including endpoint RETURN success and server-death failure.
make test-service-facet-mint covers the local endpoint-owner service-facet mint: an owner mints a client facet with no process spawned, the minted facet refuses RECV and cannot itself mint, and a client round-trips through it. Its second stage forces the mint’s per-cap epoch allocation to fail after the facet object is allocated, proving that exhaustion returns a negative completion and rolls back the facet, the cap slot, and the endpoint reference instead of failing the kernel.
make run exercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.
make run-measure exercises measurement-only counters, dispatch segment cycle summaries, the NullCap baseline, the ParkBench compact-versus-generic comparison, and the real ParkSpace blocked/resume timing path.
cargo test-config covers shared ring layout and helper invariants.
make capos-rt-check checks userspace runtime ring code under the bare-metal target.

Open Work

Implement CAP_OP_FINISH as part of the system Cap’n Proto transport.
Define an ABI for promise chains deeper than one dependent hop.
Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
Add runtime-level ParkSpace wrappers and completion demultiplexing on top of the compact opcodes.
Add the runtime reactor bridge for multithreaded use of the current process ring, then replace it as the kernel fast path with per-thread Ring v2 completion ownership.
Add SQPOLL after SMP gives the kernel a spare execution context.

Keyboard shortcuts

capOS Documentation