Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Capability Ring

The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.

The current error model is documented in Error Handling. Ring CQE status values report transport failures; typed capability exceptions and ordinary schema result unions sit above that transport layer.

Current Behavior

Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page contains a volatile header, a 16-entry submission queue, and a 32-entry completion queue. Userspace writes CapSqe records, advances sq_tail, and uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.

sequenceDiagram
    participant U as Userspace runtime
    participant R as Ring page
    participant K as Kernel ring dispatcher
    participant C as Capability object
    U->>R: write CapSqe and advance sq_tail
    U->>K: cap_enter(min_complete, timeout_ns)
    K->>R: read sq_head..sq_tail
    K->>K: validate SQE fields and lock AddressSpace for user buffers
    K->>C: call method or endpoint operation
    C-->>K: completion, pending, or error
    K->>R: write CapCqe and advance cq_tail
    K-->>U: return available CQE count
    U->>R: read matching CapCqe

Timer polling also processes each current process’s ring before preemption, but only non-CALL operations and CALL targets that explicitly allow interrupt dispatch may run there. Ordinary CALLs wait for cap_enter.

Why ordinary CALL waits for cap_enter: Submitting a CALL SQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects. cap_enter is the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited to exit and cap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.

Design

CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table slot and method ID plus parameter/result buffers. CAP_OP_RECV and CAP_OP_RETURN implement endpoint IPC. CAP_OP_RETURN normally returns successful result bytes to the original caller; with CAP_SQE_RETURN_APPLICATION_EXCEPTION, its payload is a serialized CapException and the original caller completes with CAP_ERR_APPLICATION_EXCEPTION or the truncated application-exception code. CAP_OP_RELEASE removes a local cap-table slot through the transport. CAP_OP_CANCEL (opcode 6) cancels a pending endpoint receive posted by the same process on the same endpoint cap; pipeline_dep carries the receive SQE’s user_data. CAP_OP_NOP measures the fixed ring path. CAP_OP_PARK_BENCH (opcode 7) is a measurement-only compact opcode dispatched only by kernels built with the measure feature; normal kernels reject it as malformed. CAP_OP_FINISH is ABI-reserved and currently returns CAP_ERR_UNSUPPORTED_OPCODE.

CAP_OP_RELEASE is deliberately scoped to local transport cleanup. It removes one holder’s cap-table slot after the SQE is processed, or as part of process exit cleanup; it does not revoke peer-held caps, cancel delegated authority, or stand in for an application close method. Services that need security-visible invalidation must use an explicit control path such as CapabilityManager.revoke, session expiry, object epochs, or a service-specific close/revoke protocol. Reviewers should treat claims based only on handle drop, RAII, GC finalizers, or queued release flushing as local-cleanup claims, not revocation claims.

Opcode boundary: Ring opcodes are kernel ABI, not a loophole around the syscall surface. cap_enter and exit remain the CPU trap entrypoints, but every accepted authority-bearing or resource-mutating CAP_OP_* still adds distinct kernel semantics that must pass the capability method / ring opcode / syscall decision graph. No-authority diagnostics such as CAP_OP_NOP are still kernel ABI and must stay side-effect-free and review-visible, but they are not resource authority paths. CAP_OP_PARK and CAP_OP_UNPARK are justified because blocking wait mutates scheduler state, must be thread-owned on the process ring, reserves completion credit for later wake/timeout delivery, and needs compact capability-authorized hot-path framing. They are not a precedent for moving ordinary object methods into the opcode table for convenience.

CAP_OP_CALL may set CAP_SQE_THREAD_OWNED with call_id equal to the owning thread id. If another thread drains the shared process ring first, the kernel leaves that SQE at the head instead of consuming it and returns a distinct owner-head cap_enter result instead of blocking the non-owner behind it. This is limited to context-sensitive self-thread operations such as ThreadControl.exitThread; ordinary runtime submissions leave call_id = 0.

CAP_OP_PARK and CAP_OP_UNPARK are compact capability-authorized operations for process-local ParkSpace. Wait SQEs must set CAP_SQE_THREAD_OWNED with call_id equal to the owning thread id; a non-owner cap_enter leaves the SQE at the head just like a thread-owned CALL. They reject promise-pipeline fields and run only from syscall-context ring dispatch, not timer polling. A blocking wait consumes the SQE but posts no caller CQE immediately; instead it reserves one waiter CQE credit, parks the current thread, and later completes with a non-negative park status. Ordinary CQE posting treats reserved park credits as unavailable so wake and timeout delivery cannot lose waiter completions.

The kernel copies user params into preallocated per-process scratch, dispatches capability methods, copies serialized results into caller-provided result buffers, and posts CapCqe. Current-process user copies and transfer-descriptor loads hold the caller’s AddressSpace mutex across permission validation and the actual HHDM-backed copy/read. A successful method returns non-negative bytes written. Transport failures are negative CAP_ERR_* codes. Application exceptions are serialized CapException payloads with CAP_ERR_APPLICATION_EXCEPTION. Ordinary capability implementation errors and live endpoint CALL/RETURN target errors use this application-exception path once a valid target cap or accepted endpoint relationship has been identified; malformed ring metadata, bad user buffers, lookup failures, and endpoint rollback/transfer failures stay in the transport namespace.

Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records after the params/result payload. Successful result-cap transfers append CapTransferResult records after normal result bytes.

Promise-pipelined CALLs remain rejected by current kernels. When that flag is enabled, pipeline_dep names a process-local promised-answer identifier, and pipeline_field selects a zero-based CapTransferResult record from that answer’s completion. It is not a Cap’n Proto schema field number or payload path. The kernel resolves dependencies only through the sideband result-cap records it already owns; normal result bytes stay opaque to the transport.

Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.

Choosing A Capability Method, Ring Opcode, Or Syscall

New kernel functionality should default to a normal typed capability method. The small syscall surface is only the trap surface; the ring opcode table is also a reviewed kernel ABI and must stay narrow. The decision tree below is a full-page reference in the PDF because the branches are easier to read at diagram scale than as compressed prose.

flowchart TD
    Start[New kernel-visible operation] --> Ambient{Must it run without any held capability?}
    Ambient -- yes --> Trap{Is it process lifecycle or kernel-entry control?}
    Trap -- yes --> Syscall[Consider a syscall]
    Trap -- no --> RejectAmbient[Reject or redesign around explicit authority]
    Ambient -- no --> CapMethod{Can it be expressed as a typed object method?}
    CapMethod -- no --> Redesign[Redesign the authority object or transport contract]
    CapMethod -- yes --> Hot{Is generic Cap'n Proto CALL materially wrong?}
    Hot -- no --> Method[Use CAP_OP_CALL to a capability method]
    Hot -- yes --> RingSpecific{Does it need ring/scheduler-specific semantics?}
    RingSpecific -- no --> Method
    RingSpecific -- yes --> Stable{Is the compact SQE/CQE ABI stable and capability-authorized?}
    Stable -- no --> MethodOrDesign[Keep a capability method or write a reviewed design first]
    Stable -- yes --> Opcode[Consider a new CAP_OP_* opcode]

Use a normal capability method when the operation is control plane, policy driven, service-specific, infrequent, or naturally represented by Cap’n Proto params/results. Process spawning, credential checks, storage naming, shell or network policy, virtual-memory control-plane calls, and most device-specific commands belong here unless measurement and design review prove otherwise.

Consider a compact ring opcode only when all of these are true:

  • The operation is a hot path or scheduler path where generic Cap’n Proto framing is materially wrong.
  • The operation has a small, stable field layout that fits the existing SQE/CQE model without per-interface ad hoc extensions.
  • It needs ring-specific behavior such as thread ownership, reserved completion credit, CQ ordering/backpressure, asynchronous completion delivery, or interaction with the process ring head.
  • It remains authorized by a held capability in cap_id, not by ambient process identity or guessed kernel object names.
  • It cannot be handled as a normal capability method plus a future generated fast client without losing an essential scheduler or transport invariant.

Consider a new syscall only when the operation is about entering or leaving the kernel execution context itself and cannot sensibly be authorized by a capability already available to the process. That bar is intentionally higher than the opcode bar. Ordinary resource operations should not become syscalls just because they are common.

Full-SMP Direction

The current process-wide ring is not the target ABI for full SMP. Once sibling threads in one process can run on different CPUs, a shared process CQ would force userspace to serialize completion consumption or the kernel to invent specific-wait state on top of circular-buffer slots.

The selected future direction is per-thread ring ownership, documented in Ring v2 For Full SMP. In that model, cap_enter(min_complete, timeout_ns) keeps its current aggregate wait shape, but the aggregate is the current thread’s CQ. Completion paths post by generation-checked ThreadRef, while result-cap transfers and authority still belong to the process cap table.

The first Ring v2 implementation should use kernel-chosen child-thread ring mappings. The initial fixed RING_VADDR mapping becomes a compatibility special case backed by the same RingEndpoint lifetime and waiter rules as child-thread rings. Runtime-supplied ring address ranges are deferred until VirtualMemory can reserve a ring arena without racing ordinary mappings.

The initial Phase C multi-CPU scheduler proof may continue to use the current process-wide ring as long as userspace serializes ring consumption. Ring v2 is the target for full SMP with sibling threads from one process running and waiting independently on different CPUs.

A runtime reactor can bridge the current process-wide ring for multithreaded runtimes before Ring v2: one runtime-owned drainer consumes the process CQ, matches completions by user_data, and wakes waiting threads through ParkSpace. That bridge is not the full-SMP kernel ABI.

Invariants

  • SQ and CQ sizes are powers of two and fixed by the ABI.
  • Unknown opcodes fail closed; FINISH is reserved, not silently accepted.
  • Reserved fields must be zero for currently implemented opcodes, except CAP_SQE_THREAD_OWNED CALL and PARK SQEs may carry the owning thread id in call_id.
  • Park PARK/UNPARK SQEs must keep unsupported fields zero and must not be dispatched from timer context.
  • cap_enter rejects min_complete > CQ_ENTRIES.
  • User-buffer validation and copy/read must hold the owning process AddressSpace mutex for CALL params/results, RECV result buffers, RETURN payloads, transfer descriptors, and deferred same-process completions.
  • Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
  • Per-dispatch SQE processing is bounded by SQ_ENTRIES.
  • Transfer descriptors must be aligned, valid, and bounded by MAX_TRANSFER_DESCRIPTORS.
  • Promise-pipelined dependency resolution must use sideband CapTransferResult ordinals, never general Cap’n Proto result traversal in the kernel.

Code Map

  • capos-config/src/ring.rs - shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.
  • kernel/src/cap/ring.rs - kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.
  • kernel/src/arch/x86_64/syscall.rs - cap_enter syscall.
  • kernel/src/sched.rs - timer polling, cap-enter blocking, direct IPC wake.
  • kernel/src/process.rs - ring page allocation and mapping.
  • capos-rt/src/ring.rs - runtime ring client, pending calls, transfer packing, result-cap parsing.
  • capos-rt/src/entry.rs - single-owner runtime ring client token and release queue flushing.
  • capos-config/tests/ring_loom.rs - bounded producer/consumer model.

Validation

  • cargo test-ring-loom validates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.
  • make run exercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.
  • make run-measure exercises measurement-only counters, dispatch segment cycle summaries, the NullCap baseline, the ParkBench compact-versus-generic comparison, and the real ParkSpace blocked/resume timing path.
  • cargo test-config covers shared ring layout and helper invariants.
  • make capos-rt-check checks userspace runtime ring code under the bare-metal target.

Open Work

  • Implement CAP_OP_FINISH as part of the system Cap’n Proto transport.
  • Implement promise pipelining using the reserved pipeline_dep answer ID and pipeline_field result-cap ordinal mapping.
  • Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
  • Add runtime-level ParkSpace wrappers and completion demultiplexing on top of the compact opcodes.
  • Add the runtime reactor bridge for multithreaded use of the current process ring, then replace it as the kernel fast path with per-thread Ring v2 completion ownership.
  • Add SQPOLL after SMP gives the kernel a spare execution context.