# Capability Ring

The capability ring is the userspace-to-kernel transport for capability
invocation. It avoids one syscall per operation while preserving a typed
Cap'n Proto method boundary and explicit completion reporting.

The current error model is documented in
[Error Handling](error-handling.md). Ring CQE status values report transport
failures; typed capability exceptions and ordinary schema result unions sit
above that transport layer.


## Current Behavior

Each non-idle process gets one 4 KiB ring page mapped at `RING_VADDR`. The page
contains a volatile header, a 16-entry submission queue, and a 32-entry
completion queue. Userspace writes `CapSqe` records, advances `sq_tail`, and
uses `cap_enter(min_complete, timeout_ns)` to make ordinary calls progress.

```mermaid
sequenceDiagram
    participant U as Userspace runtime
    participant R as Ring page
    participant K as Kernel ring dispatcher
    participant C as Capability object
    U->>R: write CapSqe and advance sq_tail
    U->>K: cap_enter(min_complete, timeout_ns)
    K->>R: read sq_head..sq_tail
    K->>K: validate SQE fields and lock AddressSpace for user buffers
    K->>C: call method or endpoint operation
    C-->>K: completion, pending, or error
    K->>R: write CapCqe and advance cq_tail
    K-->>U: return available CQE count
    U->>R: read matching CapCqe
```

Timer polling also processes each current process's ring before preemption, but
only non-CALL operations and CALL targets that explicitly allow interrupt
dispatch may run there. Ordinary CALLs wait for `cap_enter`.

> **Why ordinary CALL waits for `cap_enter`:** Submitting a `CALL` SQE is only a
> shared-memory write. The kernel still needs a safe execution point to drain
> the ring and run capability code. Timer polling runs in interrupt context, so
> it must not execute arbitrary capability methods that may allocate, block on
> locks, mutate page tables, spawn processes, parse Cap'n Proto messages, or
> perform IPC side effects. `cap_enter` is the normal process-context drain
> point: it processes pending SQEs, posts CQEs, and then either returns the
> available completion count or blocks until enough completions arrive. The
> design keeps SQE publication syscall-free and batchable, keeps the syscall ABI
> limited to `exit` and `cap_enter`, and avoids turning the timer interrupt into
> a general capability executor. A future SQPOLL-style path can remove the
> explicit syscall from the hot path only by running dispatch in a worker
> context, not from arbitrary timer interrupt execution.

## Design

`CapSqe` is a fixed 64-byte ABI record. `CAP_OP_CALL` names a local cap-table
slot and method ID plus parameter/result buffers. `CAP_OP_RECV` and
`CAP_OP_RETURN` implement endpoint IPC. `CAP_OP_RETURN` normally returns
successful result bytes to the original caller; with
`CAP_SQE_RETURN_APPLICATION_EXCEPTION`, its payload is a serialized
`CapException` and the original caller completes with
`CAP_ERR_APPLICATION_EXCEPTION` or the truncated application-exception code.
`CAP_OP_RELEASE` removes a local cap-table slot through the transport.
`CAP_OP_CANCEL` (opcode 6) cancels a pending endpoint receive posted by the
same process on the same endpoint cap; `pipeline_dep` carries the receive
SQE's `user_data`. `CAP_OP_NOP` measures the fixed ring path.
`CAP_OP_PARK_BENCH` (opcode 7) is a measurement-only compact opcode dispatched
only by kernels built with the `measure` feature; normal kernels reject it as
malformed. `CAP_OP_FINISH` is ABI-reserved and currently returns
`CAP_ERR_UNSUPPORTED_OPCODE`.

`CAP_OP_RELEASE` is deliberately scoped to local transport cleanup. It removes
one holder's cap-table slot after the SQE is processed, or as part of process
exit cleanup; it does not revoke peer-held caps, cancel delegated authority, or
stand in for an application `close` method. Services that need security-visible
invalidation must use an explicit control path such as `CapabilityManager.revoke`,
session expiry, object epochs, or a service-specific close/revoke protocol.
Reviewers should treat claims based only on handle drop, RAII, GC finalizers, or
queued release flushing as local-cleanup claims, not revocation claims.

> **Opcode boundary:** Ring opcodes are kernel ABI, not a loophole around the
> syscall surface. `cap_enter` and `exit` remain the CPU trap entrypoints, but
> every accepted authority-bearing or resource-mutating `CAP_OP_*` still adds
> distinct kernel semantics that must pass the
> [capability method / ring opcode / syscall decision graph](#choosing-a-capability-method-ring-opcode-or-syscall).
> No-authority diagnostics such as `CAP_OP_NOP` are still kernel ABI and must
> stay side-effect-free and review-visible, but they are not resource authority
> paths.
> `CAP_OP_PARK` and `CAP_OP_UNPARK` are justified because blocking
> wait mutates scheduler state, must be thread-owned on the process ring,
> reserves completion credit for later wake/timeout delivery, and needs compact
> capability-authorized hot-path framing. They are not a precedent for moving
> ordinary object methods into the opcode table for convenience.

`CAP_OP_CALL` may set `CAP_SQE_THREAD_OWNED` with `call_id` equal to the owning
thread id. If another thread drains the shared process ring first, the kernel
leaves that SQE at the head instead of consuming it and returns a distinct
owner-head `cap_enter` result instead of blocking the non-owner behind it. This
is limited to context-sensitive self-thread operations such as
`ThreadControl.exitThread`; ordinary runtime submissions leave `call_id = 0`.

`CAP_OP_PARK` and `CAP_OP_UNPARK` are compact capability-authorized
operations for process-local ParkSpace. Wait SQEs must set
`CAP_SQE_THREAD_OWNED` with `call_id` equal to the owning thread id; a
non-owner `cap_enter` leaves the SQE at the head just like a thread-owned CALL.
They reject promise-pipeline fields and run only from syscall-context ring
dispatch, not timer polling. A blocking wait consumes the SQE but posts no
caller CQE immediately; instead it reserves one waiter CQE credit, parks the
current thread, and later completes with a non-negative park status. Ordinary
CQE posting treats reserved park credits as unavailable so wake and timeout
delivery cannot lose waiter completions.

The kernel copies user params into preallocated per-process scratch, dispatches
capability methods, copies serialized results into caller-provided result
buffers, and posts `CapCqe`. Current-process user copies and transfer-descriptor
loads hold the caller's `AddressSpace` mutex across permission validation and
the actual HHDM-backed copy/read. A successful method returns non-negative bytes
written. Transport failures are negative `CAP_ERR_*` codes. Application
exceptions are serialized `CapException` payloads with
`CAP_ERR_APPLICATION_EXCEPTION`. Ordinary capability implementation errors and
live endpoint CALL/RETURN target errors use this application-exception path
once a valid target cap or accepted endpoint relationship has been identified;
malformed ring metadata, bad user buffers, lookup failures, and endpoint
rollback/transfer failures stay in the transport namespace.

Transfer-bearing CALL and RETURN SQEs pack `CapTransferDescriptor` records
after the params/result payload. Successful result-cap transfers append
`CapTransferResult` records after normal result bytes.

Promise-pipelined CALLs remain rejected by current kernels. When that flag is
enabled, `pipeline_dep` names a process-local promised-answer identifier, and
`pipeline_field` selects a zero-based `CapTransferResult` record from that
answer's completion. It is not a Cap'n Proto schema field number or payload
path. The kernel resolves dependencies only through the sideband result-cap
records it already owns; normal result bytes stay opaque to the transport.

Future behavior should use the reserved SQE fields for system transport
features, not ad hoc per-interface extensions.

## Choosing A Capability Method, Ring Opcode, Or Syscall

New kernel functionality should default to a normal typed capability method.
The small syscall surface is only the trap surface; the ring opcode table is
also a reviewed kernel ABI and must stay narrow. The decision tree below is a
full-page reference in the PDF because the branches are easier to read at
diagram scale than as compressed prose.

```mermaid
flowchart TD
    Start[New kernel-visible operation] --> Ambient{Must it run without any held capability?}
    Ambient -- yes --> Trap{Is it process lifecycle or kernel-entry control?}
    Trap -- yes --> Syscall[Consider a syscall]
    Trap -- no --> RejectAmbient[Reject or redesign around explicit authority]
    Ambient -- no --> CapMethod{Can it be expressed as a typed object method?}
    CapMethod -- no --> Redesign[Redesign the authority object or transport contract]
    CapMethod -- yes --> Hot{Is generic Cap'n Proto CALL materially wrong?}
    Hot -- no --> Method[Use CAP_OP_CALL to a capability method]
    Hot -- yes --> RingSpecific{Does it need ring/scheduler-specific semantics?}
    RingSpecific -- no --> Method
    RingSpecific -- yes --> Stable{Is the compact SQE/CQE ABI stable and capability-authorized?}
    Stable -- no --> MethodOrDesign[Keep a capability method or write a reviewed design first]
    Stable -- yes --> Opcode[Consider a new CAP_OP_* opcode]
```

Use a normal capability method when the operation is control plane, policy
driven, service-specific, infrequent, or naturally represented by Cap'n Proto
params/results. Process spawning, credential checks, storage naming, shell or
network policy, virtual-memory control-plane calls, and most device-specific
commands belong here unless measurement and design review prove otherwise.

Consider a compact ring opcode only when all of these are true:

- The operation is a hot path or scheduler path where generic Cap'n Proto
  framing is materially wrong.
- The operation has a small, stable field layout that fits the existing
  SQE/CQE model without per-interface ad hoc extensions.
- It needs ring-specific behavior such as thread ownership, reserved
  completion credit, CQ ordering/backpressure, asynchronous completion
  delivery, or interaction with the process ring head.
- It remains authorized by a held capability in `cap_id`, not by ambient
  process identity or guessed kernel object names.
- It cannot be handled as a normal capability method plus a future generated
  fast client without losing an essential scheduler or transport invariant.

Consider a new syscall only when the operation is about entering or leaving the
kernel execution context itself and cannot sensibly be authorized by a
capability already available to the process. That bar is intentionally higher
than the opcode bar. Ordinary resource operations should not become syscalls
just because they are common.

## Full-SMP Direction

The current process-wide ring is not the target ABI for full SMP. Once sibling
threads in one process can run on different CPUs, a shared process CQ would
force userspace to serialize completion consumption or the kernel to invent
specific-wait state on top of circular-buffer slots.

The selected future direction is per-thread ring ownership, documented in
[Ring v2 For Full SMP](../proposals/ring-v2-smp-proposal.md). In that model,
`cap_enter(min_complete, timeout_ns)` keeps its current aggregate wait shape,
but the aggregate is the current thread's CQ. Completion paths post by
generation-checked `ThreadRef`, while result-cap transfers and authority still
belong to the process cap table.

The first Ring v2 implementation should use kernel-chosen child-thread ring
mappings. The initial fixed `RING_VADDR` mapping becomes a compatibility
special case backed by the same `RingEndpoint` lifetime and waiter rules as
child-thread rings. Runtime-supplied ring address ranges are deferred until
`VirtualMemory` can reserve a ring arena without racing ordinary mappings.

The initial Phase C multi-CPU scheduler proof may continue to use the current
process-wide ring as long as userspace serializes ring consumption. Ring v2 is
the target for full SMP with sibling threads from one process running and
waiting independently on different CPUs.

A runtime reactor can bridge the current process-wide ring for multithreaded
runtimes before Ring v2: one runtime-owned drainer consumes the process CQ,
matches completions by `user_data`, and wakes waiting threads through
ParkSpace. That bridge is not the full-SMP kernel ABI.

## Invariants

- SQ and CQ sizes are powers of two and fixed by the ABI.
- Unknown opcodes fail closed; `FINISH` is reserved, not silently accepted.
- Reserved fields must be zero for currently implemented opcodes, except
  `CAP_SQE_THREAD_OWNED` CALL and PARK SQEs may carry the owning thread
  id in `call_id`.
- Park PARK/UNPARK SQEs must keep unsupported fields zero and must not be
  dispatched from timer context.
- `cap_enter` rejects `min_complete > CQ_ENTRIES`.
- User-buffer validation and copy/read must hold the owning process
  `AddressSpace` mutex for CALL params/results, RECV result buffers, RETURN
  payloads, transfer descriptors, and deferred same-process completions.
- Timer dispatch must not run capabilities that allocate, block on locks, or
  mutate page tables unless the cap explicitly opts in.
- Per-dispatch SQE processing is bounded by `SQ_ENTRIES`.
- Transfer descriptors must be aligned, valid, and bounded by
  `MAX_TRANSFER_DESCRIPTORS`.
- Promise-pipelined dependency resolution must use sideband
  `CapTransferResult` ordinals, never general Cap'n Proto result traversal in
  the kernel.

## Code Map

- `capos-config/src/ring.rs` - shared ring ABI, opcodes, errors, SQE/CQE
  structs, endpoint message headers, transfer records.
- `kernel/src/cap/ring.rs` - kernel dispatcher, SQE validation, CQE posting,
  cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.
- `kernel/src/arch/x86_64/syscall.rs` - `cap_enter` syscall.
- `kernel/src/sched.rs` - timer polling, cap-enter blocking, direct IPC wake.
- `kernel/src/process.rs` - ring page allocation and mapping.
- `capos-rt/src/ring.rs` - runtime ring client, pending calls, transfer packing,
  result-cap parsing.
- `capos-rt/src/entry.rs` - single-owner runtime ring client token and release
  queue flushing.
- `capos-config/tests/ring_loom.rs` - bounded producer/consumer model.

## Validation

- `cargo test-ring-loom` validates SQ/CQ producer-consumer behavior, capacity,
  FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.
- `make run` exercises Console CALLs, reserved opcode rejection, ring corruption
  recovery, NOP, fairness, transfers, and endpoint IPC.
- `make run-measure` exercises measurement-only counters, dispatch segment
  cycle summaries, the NullCap baseline, the ParkBench compact-versus-generic
  comparison, and the real ParkSpace blocked/resume timing path.
- `cargo test-config` covers shared ring layout and helper invariants.
- `make capos-rt-check` checks userspace runtime ring code under the
  bare-metal target.

## Open Work

- Implement `CAP_OP_FINISH` as part of the system Cap'n Proto transport.
- Implement promise pipelining using the reserved `pipeline_dep` answer ID and
  `pipeline_field` result-cap ordinal mapping.
- Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
- Add runtime-level ParkSpace wrappers and completion demultiplexing on top of
  the compact opcodes.
- Add the runtime reactor bridge for multithreaded use of the current process
  ring, then replace it as the kernel fast path with per-thread Ring v2
  completion ownership.
- Add SQPOLL after SMP gives the kernel a spare execution context.