# Fuchsia Zircon Kernel: Research Report for capOS

Research into Zircon's design for informing capOS capability model, IPC,
virtual memory, async I/O, and interface definition decisions.

## 1. Handle-Based Capability Model

### Overview

Zircon implements capabilities as **handles**. A handle is a process-local
integer (similar to a Unix file descriptor) that references a kernel object and
carries a bitmask of **rights**. The kernel maintains a per-process handle table
that maps handle values to (kernel_object_pointer, rights) pairs. Processes can
only interact with kernel objects through handles they hold.

There is no ambient authority in Zircon. A process cannot address kernel objects
by name, path, or global ID -- it must possess a handle. The initial set of
handles is passed to a process at creation time by its parent (or by the
component framework).

### Handle Representation

Internally, a handle is:
- A **process-local 32-bit integer** (the "handle value"). The low two bits
  encode a generation counter to detect use-after-close.
- A reference to a **kernel object** (refcounted `Dispatcher` in Zircon's C++).
- A **rights bitmask** (`zx_rights_t`, a `uint32_t`).

The handle table is per-process, so handle value `0x1234` in process A and
`0x1234` in process B refer to completely different objects (or nothing).

### Rights

Rights are a bitmask that constrain what operations a handle can perform.
Key rights include:

| Right | Meaning |
|---|---|
| `ZX_RIGHT_DUPLICATE` | Can be duplicated via `zx_handle_duplicate()` |
| `ZX_RIGHT_TRANSFER` | Can be sent through a channel |
| `ZX_RIGHT_READ` | Can read data (channel messages, VMO bytes) |
| `ZX_RIGHT_WRITE` | Can write data |
| `ZX_RIGHT_EXECUTE` | VMO can be mapped as executable |
| `ZX_RIGHT_MAP` | VMO can be mapped into a VMAR |
| `ZX_RIGHT_GET_PROPERTY` | Can query object properties |
| `ZX_RIGHT_SET_PROPERTY` | Can modify object properties |
| `ZX_RIGHT_SIGNAL` | Can set user signals on the object |
| `ZX_RIGHT_WAIT` | Can wait on the object's signals |
| `ZX_RIGHT_MANAGE_PROCESS` | Can perform management ops on a process |
| `ZX_RIGHT_MANAGE_THREAD` | Can manage threads |

When a syscall is invoked on a handle, the kernel checks that the handle's
rights include the rights required by that syscall. For example,
`zx_channel_write()` requires `ZX_RIGHT_WRITE` on the channel handle.

Rights can only be **reduced**, never amplified. `zx_handle_duplicate()` takes
a rights mask and the new handle gets `original_rights & requested_rights`.

### Handle Lifecycle

**Creation**: Syscalls that create kernel objects return handles. For example,
`zx_channel_create()` returns two handles (one for each endpoint).
`zx_vmo_create()` returns a VMO handle. The initial rights are defined per
object type (e.g., a new channel endpoint gets
`READ|WRITE|TRANSFER|DUPLICATE|SIGNAL|WAIT`).

**Duplication**: `zx_handle_duplicate(handle, rights) -> new_handle`. Creates
a second handle to the same kernel object, possibly with reduced rights. The
original is untouched. Requires `ZX_RIGHT_DUPLICATE` on the source handle.

**Transfer**: Handles are transferred through channels. When a message is
written to a channel, handles listed in the message are **moved** from the
sender's handle table to a transient state inside the channel message. When the
message is read, those handles are installed into the receiver's handle table
with new handle values. The original handle values in the sender become invalid.
Transfer requires `ZX_RIGHT_TRANSFER` on each handle being sent.

**Replacement**: `zx_handle_replace(handle, rights) -> new_handle`. Atomically
invalidates the old handle and creates a new one with the specified rights
(must be a subset). This avoids a window where two handles exist simultaneously
(unlike duplicate-then-close). Useful for reducing rights before transferring.

**Closing**: `zx_handle_close(handle)`. Removes the handle from the process's
table and decrements the kernel object's refcount. When the last handle to an
object is closed, the object is destroyed (with some exceptions like the
kernel itself keeping references).

### Comparison to capOS

capOS's current `CapTable` maps `CapId` (u32) to an `Arc<dyn CapObject>`. The
shared `Arc` lets a single kernel capability (for example, a `kernel:endpoint`
owned by one service and referenced by another through `CapSource::Service`)
back multiple per-process `CapTable` slots for cross-process IPC. This is
conceptually similar to Zircon's handle table, but with key differences:

| Aspect | Zircon | capOS (current) |
|---|---|---|
| Rights | Bitmask per handle | None (all-or-nothing) |
| Object types | Fixed kernel types (Channel, VMO, etc.) | Extensible via `CapObject` trait |
| Transfer | Move semantics through channels | Copy/move descriptors through Endpoint IPC |
| Duplication | Explicit with rights reduction | Copy transfer for transferable holds |
| Revocation | Close handle; object dies with last ref | Remove from table; no propagation |
| Interface | Fixed syscall per object type | Cap'n Proto method dispatch |
| Generation counter | Low bits of handle value | Upper bits of `CapId` |

**Recommendations for capOS:**

1. **Keep method authority in typed interfaces for now.** Zircon's rights
   bitmask is useful for an untyped syscall surface. capOS currently uses
   narrow Cap'n Proto interfaces plus hold-edge transfer metadata; generic
   READ/WRITE flags would duplicate schema-level authority unless a concrete
   cross-interface need appears.

2. **Handle generation counters.** Implemented: capOS encodes a generation
   tag in the upper bits of `CapId`, with lower bits selecting the table slot.
   This catches stale CapId use after slot reuse.

3. **Move semantics for transfer.** Implemented for Endpoint CALL/RETURN
   sideband descriptors. Copy transfer remains explicit and requires a
   transferable source hold.

4. **`replace` operation.** An atomic replace (invalidate old, create new with
   reduced rights) is cleaner than duplicate-then-close for rights attenuation
   before transfer.


## 2. Channels

### Overview

Zircon channels are the fundamental IPC primitive. A channel is a bidirectional,
asynchronous message-passing pipe with two endpoints. Each endpoint is a
separate kernel object referenced by a handle.

### Creation and Structure

`zx_channel_create(options, &handle0, &handle1)` creates a channel and returns
handles to both endpoints. Each endpoint can be independently transferred to
different processes. When one endpoint is closed, the other becomes
"peer-closed" (signaled with `ZX_CHANNEL_PEER_CLOSED`).

### Message Format

A channel message consists of:
- **Data**: Up to **65,536 bytes** (64 KiB) of arbitrary byte payload.
- **Handles**: Up to **64 handles** transferred with the message.

Messages are discrete and ordered (FIFO). There is no streaming or partial
reads -- you read a complete message or nothing.

### Write and Read Syscalls

**Write**: `zx_channel_write(handle, options, bytes, num_bytes, handles, num_handles)`
- Copies `bytes` into the kernel message queue.
- **Moves** each handle in the `handles` array from the caller's handle table
  into the message. If any handle is invalid or lacks `ZX_RIGHT_TRANSFER`,
  the entire write fails and no handles are moved.
- The write is non-blocking. If the peer has been closed, returns
  `ZX_ERR_PEER_CLOSED`.

**Read**: `zx_channel_read(handle, options, bytes, handles, num_bytes, num_handles, actual_bytes, actual_handles)`
- Dequeues the next message. Copies data into `bytes`, installs handles into
  the caller's handle table, writing new handle values into the `handles`
  array.
- If the buffer is too small, returns `ZX_ERR_BUFFER_TOO_SMALL` and fills
  `actual_bytes`/`actual_handles` so the caller can retry with a larger buffer.
- Non-blocking by default.

**`zx_channel_call`**: A synchronous call primitive. Writes a message to the
channel, then blocks waiting for a reply with a matching transaction ID. This
is the primary mechanism for client-server RPC. The kernel optimizes this path
to avoid unnecessary scheduling: if the server thread is waiting to read, the
kernel can directly switch to it (similar to L4 IPC optimizations).

### Handle Transfer Mechanics

When handles are sent through a channel:
1. The kernel validates all handles (exist, have `TRANSFER` right).
2. Handles are atomically removed from the sender's table.
3. Handle objects are stored inside the kernel message structure.
4. On read, handles are inserted into the receiver's table with fresh
   handle values.
5. If the channel is destroyed with unread messages containing handles,
   those handles are closed (objects' refcounts decremented).

This is critical: handle transfer is **move**, not copy. The sender loses the
handle. To keep a copy, the sender must `duplicate` before sending.

### Signals

Each channel endpoint has associated signals:
- `ZX_CHANNEL_READABLE` -- at least one message is queued.
- `ZX_CHANNEL_PEER_CLOSED` -- the other endpoint was closed.

Processes can wait on these signals using `zx_object_wait_one()`,
`zx_object_wait_many()`, or by binding to a **port** (see Section 4).

### FIDL Relationship

Channels carry raw bytes + handles. **FIDL** (Section 5) provides the
structured protocol layer on top: it defines how bytes are laid out (message
header with transaction ID, ordinal, flags; then the payload) and how handles
in the message correspond to protocol-level concepts (client endpoints, server
endpoints, VMOs, etc.).

Every FIDL protocol communication happens over a channel. A FIDL "client end"
is a channel endpoint handle where the client sends requests and reads
responses. A "server end" is the other endpoint where the server reads requests
and sends responses.

### Comparison to capOS

capOS currently uses shared submission/completion rings with Endpoint objects
for cross-process CALL/RECV/RETURN routing. Same-process capabilities dispatch
directly through the holder's table; cross-process Endpoint calls queue to the
server ring and can trigger a direct IPC handoff when the receiver is blocked.

| Aspect | Zircon Channels | capOS |
|---|---|---|
| Topology | Point-to-point, 2 endpoints | Endpoint-routed capability calls |
| Async | Non-blocking read/write + signal waits | Shared SQ/CQ rings |
| Handle/cap transfer | Embedded in messages | Sideband transfer descriptors |
| Message format | Raw bytes + handles | Cap'n Proto serialized |
| Size limits | 64 KiB data, 64 handles | 64 KiB params (current limit) |
| Buffering | Kernel-side message queue | Endpoint queues plus per-process rings |

**Recommendations for capOS:**

1. **Capability transfer alongside capnp messages.** Zircon embeds handles as
   out-of-band data alongside message bytes. capOS has adopted the same
   separation with ring sideband transfer descriptors and result-cap records.
   That keeps the kernel from parsing arbitrary Cap'n Proto payload graphs.

2. **Two-endpoint channels vs. Endpoint calls.** Zircon's channels are
   general-purpose pipes. capOS uses a lighter Endpoint CALL/RECV/RETURN model
   where a capability invocation is routed to the serving process rather than
   requiring a channel object per connection.

3. **Message size limits.** Zircon's 64 KiB limit has been a pain point
   (large data must go through VMOs). capOS's capnp messages naturally handle
   this because large data can be a separate VMO-like capability referenced
   in the message. Keep the per-message limit reasonable (64 KiB is a good
   default) and use capability references for bulk data.


## 3. VMARs and VMOs

### Virtual Memory Objects (VMOs)

A VMO is a kernel object representing a contiguous region of virtual memory
that can be mapped into address spaces. VMOs are the fundamental unit of
memory in Zircon.

**Types:**
- **Paged VMO**: Backed by the page fault handler. Pages are allocated on
  demand. This is the default.
- **Physical VMO**: Backed by a specific contiguous range of physical memory.
  Used for device MMIO.
- **Contiguous VMO**: Like a paged VMO but guarantees physically contiguous
  pages. Used for DMA.

**Key operations:**
- `zx_vmo_create(size, options) -> handle`: Create a paged VMO.
- `zx_vmo_read(handle, buffer, offset, length)`: Read bytes from a VMO.
- `zx_vmo_write(handle, buffer, offset, length)`: Write bytes to a VMO.
- `zx_vmo_get_size()` / `zx_vmo_set_size()`: Query/resize.
- `zx_vmo_op_range()`: Operations like commit (force-allocate pages),
  decommit (release pages back to system), cache ops.

VMOs can be read/written directly via syscalls without mapping them. This is
useful for small transfers but less efficient than mapping for large data.

### Copy-on-Write (CoW) Cloning

`zx_vmo_create_child(handle, options, offset, size) -> child_handle`

Creates a child VMO that is a **CoW clone** of a range within the parent.
Several clone types exist:

- **Snapshot** (`ZX_VMO_CHILD_SNAPSHOT`): Point-in-time snapshot. Both parent
  and child see CoW pages. Writes to either side trigger page copies. The child
  is fully independent after creation -- closing the parent does not affect
  committed pages in the child.

- **Slice** (`ZX_VMO_CHILD_SLICE`): A window into the parent. No CoW --
  writes to the slice are visible through the parent and vice versa. The child
  cannot outlive the parent.

- **Snapshot-at-least-on-write** (`ZX_VMO_CHILD_SNAPSHOT_AT_LEAST_ON_WRITE`):
  Like snapshot but allows the implementation to share unchanged pages between
  parent and child more aggressively (pages only diverge when written).

CoW cloning is central to how Fuchsia implements `fork()`-like semantics for
memory (though Fuchsia doesn't have `fork()`) and how it shares immutable data
(e.g., shared libraries are CoW-cloned VMOs).

### Virtual Memory Address Regions (VMARs)

A VMAR represents a contiguous range of virtual address space within a process.
VMARs form a **tree** rooted at the process's root VMAR, which covers the
entire user-accessible address space.

**Hierarchy:**
```
Root VMAR (entire user address space)
  +-- Sub-VMAR A (e.g., 0x1000..0x10000)
  |     +-- Mapping of VMO X at offset 0x1000
  |     +-- Sub-VMAR B (0x5000..0x8000)
  |           +-- Mapping of VMO Y at offset 0x5000
  +-- Sub-VMAR C (0x20000..0x30000)
        +-- Mapping of VMO Z at offset 0x20000
```

**Key operations:**
- `zx_vmar_map(vmar, options, offset, vmo, vmo_offset, len) -> addr`:
  Map a VMO (or a range of it) into the VMAR at a specific offset or let
  the kernel choose (ASLR).
- `zx_vmar_unmap(vmar, addr, len)`: Remove a mapping.
- `zx_vmar_protect(vmar, options, addr, len)`: Change permissions
  (read/write/execute) on a mapped range.
- `zx_vmar_allocate(vmar, options, offset, size) -> child_vmar, addr`:
  Create a sub-VMAR.
- `zx_vmar_destroy(vmar)`: Recursively unmap everything and destroy all
  sub-VMARs. Prevents new mappings.

**ASLR**: Zircon implements address space layout randomization through VMARs.
When `ZX_VM_OFFSET_IS_UPPER_LIMIT` or no specific offset is given, the kernel
randomizes placement within the VMAR.

**Permissions**: Mapping permissions (R/W/X) are constrained by the VMO
handle's rights. A VMO handle without `ZX_RIGHT_EXECUTE` cannot be mapped
as executable, regardless of what the `zx_vmar_map()` call requests.

### Why VMARs Matter

VMARs provide:
1. **Sandboxing within a process.** A component can be given a sub-VMAR
   handle instead of the root VMAR, limiting where it can map memory.
2. **Hierarchical cleanup.** Destroying a VMAR recursively unmaps everything
   beneath it.
3. **Controlled mapping.** The parent decides the address space layout for
   child components by allocating sub-VMARs and passing only sub-VMAR handles.

### Comparison to capOS

capOS currently has `AddressSpace` plus a `VirtualMemory` capability for
anonymous map/unmap/protect operations. `FrameAllocator` returns typed
`MemoryObject` ownership caps rather than raw physical frame grants, but
`MemoryObject` does not yet provide mapping, cloning, or zero-copy sharing.

| Aspect | Zircon | capOS (current) |
|---|---|---|
| Memory objects | VMO (paged, physical, contiguous) | Owned MemoryObject caps plus anonymous VirtualMemory mappings |
| CoW | VMO child clones (snapshot, slice) | Not implemented |
| Address space | VMAR tree | Flat AddressSpace plus VirtualMemory cap |
| Sharing | Map same VMO in multiple processes | Not implemented |
| Permissions | Per-mapping + per-handle rights | Per-page flags at mapping time |

**Recommendations for capOS:**

1. **VMO-equivalent capability.** A "MemoryObject" capability that represents
   a range of memory (backed by demand-paging or physical pages). This becomes
   the unit of sharing: pass a MemoryObject cap through IPC, and the receiver
   maps it into their address space. Define it in `schema/capos.capnp`.

2. **Sub-VMAR capabilities for sandboxing.** When spawning a process,
   instead of granting access to the full address space, grant a sub-region
   capability. This limits where the process can map memory.

3. **CoW cloning is valuable but not urgent.** The primary use case (shared
   libraries, fork) may not apply to capOS's early stages. Design the VMO
   interface to support cloning later.

4. **VMO read/write without mapping.** Zircon allows reading/writing VMO
   contents via syscall without mapping. This is useful for small IPC data
   and avoids TLB pressure. Consider supporting this in capOS's MemoryObject.


## 4. Async Model (Ports)

### Overview

Zircon's async I/O model is built around **ports** -- kernel objects that
receive event packets. A port is similar to Linux's `epoll` but with important
differences. It is the foundation for all async programming in Fuchsia.

### Port Basics

A port is a kernel object with a queue of **packets** (`zx_port_packet_t`).
Packets arrive either from signal-based waits or from direct user queuing.

**Key operations:**
- `zx_port_create(options) -> handle`: Create a port.
- `zx_port_wait(port, deadline) -> packet`: Dequeue the next packet, blocking
  until one is available or the deadline expires.
- `zx_port_queue(port, packet)`: Manually enqueue a user packet.
- `zx_port_cancel(port, source, key)`: Cancel pending waits.

### Signal-Based Async (Object Wait Async)

`zx_object_wait_async(object, port, key, signals, options)`:

This is the primary mechanism. It tells the kernel: "when `object` has any of
these `signals` asserted, deliver a packet to `port` with this `key`."

Two modes:
- **One-shot** (`ZX_WAIT_ASYNC_ONCE`): The wait fires once and is automatically
  removed. The user must re-register after handling.
- **Edge-triggered** (`ZX_WAIT_ASYNC_EDGE`): Fires each time a signal
  transitions from deasserted to asserted. Stays registered.

### Packet Format

```c
typedef struct zx_port_packet {
    uint64_t key;              // User-defined key (set during wait_async)
    uint32_t type;             // ZX_PKT_TYPE_SIGNAL_ONE, ZX_PKT_TYPE_USER, etc.
    zx_status_t status;        // Result status
    union {
        zx_packet_signal_t signal;   // Which signals triggered
        zx_packet_user_t user;       // User-queued packet payload (32 bytes)
        zx_packet_guest_bell_t guest_bell;
        // ... other packet types
    };
} zx_port_packet_t;
```

The `signal` variant includes `trigger` (which signals were waited on),
`observed` (current signal state), and a `count` (for edge-triggered, how many
transitions).

### Async Dispatching (libasync)

Fuchsia's userspace `async` library (`libfidl`, `async-loop`) provides a
higher-level event loop:

1. **`async::Loop`**: An event loop that owns a port and dispatches events
   to registered handlers.
2. **`async::Wait`**: Wraps `zx_object_wait_async()` with a callback. When
   the signal fires, the loop calls the handler.
3. **`async::Task`**: Runs a closure on the loop's dispatcher.
4. **FIDL bindings**: The async FIDL bindings register channel-readable waits
   on the loop's port. When a message arrives, the FIDL dispatcher decodes
   it and calls the appropriate protocol method handler.

The typical pattern:

```
loop = async::Loop()
loop.port -> zx_port_create()

// Register interest in channel readability
zx_object_wait_async(channel, loop.port, key, ZX_CHANNEL_READABLE)

// Event loop
while True:
    packet = zx_port_wait(loop.port)
    handler = lookup(packet.key)
    handler(packet)
    // Re-register if one-shot
```

### Comparison to Linux io_uring

| Aspect | Zircon Ports | Linux io_uring |
|---|---|---|
| Model | Event notification (signals) | Operation submission/completion |
| Submission | No SQ; operations are separate syscalls | SQ ring: batch operations |
| Completion | Port packet queue | CQ ring in shared memory |
| Kernel transitions | One per wait_async + one per port_wait | One per io_uring_enter (batched) |
| Memory sharing | No shared ring buffers | SQ/CQ are mmap'd shared memory |
| Zero-copy | Not for port packets | Registered buffers, fixed files |
| Batching | No inherent batching | Core design: submit N ops, one syscall |
| Chaining | Not supported | SQE linking (sequential/parallel) |
| Scope | Signal notification only | Full I/O operations (read, write, send, recv, fsync, ...) |

Key differences:

1. **Ports are notification-based; io_uring is operation-based.** A port tells
   you "something happened" (a signal was asserted), then you do separate
   syscalls to act on it (read the channel, accept the socket, etc.).
   io_uring lets you submit the actual I/O operation and the kernel does it
   asynchronously, returning the result in the completion ring.

2. **io_uring avoids syscalls for submission.** The submission queue is shared
   memory -- userspace writes SQEs and the kernel reads them without a syscall
   (in polling mode) or with a single `io_uring_enter()` for a batch of
   operations. Ports require a syscall per `wait_async` registration.

3. **io_uring supports chaining.** SQE linking allows dependent operations
   (e.g., "read from file, then write to socket") without returning to
   userspace between steps.

4. **Ports are simpler.** The signal model is straightforward and composes
   well with Zircon's object model. io_uring's complexity (dozens of opcodes,
   registered buffers, fixed files, kernel-side polling) is much higher.

### Performance Tradeoffs

Ports:
- **Pro**: Simple, well-integrated with kernel object model, easy to reason about.
- **Con**: Extra syscalls per operation (wait_async to register, port_wait to
  receive, then the actual operation syscall). At least 3 syscalls per async
  operation.

io_uring:
- **Pro**: Can batch many operations in a single syscall. Shared-memory rings
  avoid copies. Kernel-side polling can eliminate syscalls entirely.
- **Con**: Complex API surface, security attack surface (many kernel bugs have
  been in io_uring), complex state management.

### Comparison to capOS's Planned Async Rings

capOS plans io_uring-inspired capability rings: an SQ where userspace submits
capnp-serialized capability invocations and a CQ where the kernel posts
completions.

| Aspect | Zircon Ports | capOS Planned Rings |
|---|---|---|
| Submission | Separate syscalls | SQ in shared memory |
| Completion | Port packet queue (kernel-owned) | CQ in shared memory |
| Operation scope | Signal notification only | Full capability invocations |
| Batching | None | Natural (fill SQ, single syscall) |
| Wire format | Fixed packet struct | Cap'n Proto messages |

**Recommendations for capOS:**

1. **The io_uring model is better than ports for capOS's use case.** Since
   every operation in capOS is a capability invocation (not just a signal
   notification), putting the full operation in the submission ring eliminates
   the extra round-trip that ports require. This is the right choice.

2. **Keep a signal/notification mechanism too.** Even with async rings, capOS
   needs a way to wait for events (e.g., "data available on this channel",
   "process exited"). Consider a simple signal/wait mechanism alongside the
   capability rings -- perhaps signal delivery goes through the CQ as a special
   completion type.

3. **Study io_uring's SQE linking.** Chaining dependent capability calls
   (e.g., "read from FileStore, then write to Console") without returning to
   userspace is powerful. This maps naturally to Cap'n Proto promise
   pipelining: "call method A on cap X, then call method B on the result's
   capability" -- the kernel can chain these internally.

4. **Registered/fixed capabilities.** io_uring has "fixed files" (registered
   fd set for faster lookup). capOS could have a "hot set" of capabilities
   pinned in the SQ context for faster dispatch (avoid per-call table lookup).

5. **Completion ordering.** io_uring completions can arrive out of order.
   capOS's CQ should also support out-of-order completion (each SQE has a
   user_data tag echoed in the CQE) to enable true async pipelining.


## 5. FIDL (Fuchsia Interface Definition Language)

### Overview

FIDL is Fuchsia's IDL for defining protocols that communicate over channels.
It serves a similar role to Cap'n Proto schemas in capOS: defining the contract
between client and server.

### FIDL vs. Cap'n Proto: Schema Language

**FIDL example:**
```fidl
library fuchsia.example;

type Color = strict enum : uint32 {
    RED = 1;
    GREEN = 2;
    BLUE = 3;
};

protocol Painter {
    SetColor(struct { color Color; }) -> ();
    DrawLine(struct { x0 float32; y0 float32; x1 float32; y1 float32; }) -> ();
    -> OnPaintComplete(struct { num_pixels uint64; });
};
```

**Equivalent Cap'n Proto:**
```capnp
enum Color { red @0; green @1; blue @2; }

interface Painter {
    setColor @0 (color :Color) -> ();
    drawLine @1 (x0 :Float32, y0 :Float32, x1 :Float32, y1 :Float32) -> ();
}
```

Key differences in the schema language:

| Feature | FIDL | Cap'n Proto |
|---|---|---|
| Unions | `flexible union`, `strict union` | Anonymous unions in structs |
| Enums | `strict enum`, `flexible enum` | `enum` (always strict) |
| Optionality | `box<T>`, nullable types | Default values, `union` with Void |
| Evolution | `flexible` keyword for forward compat | Field numbering, `@N` ordinals |
| Tables | `table` (like protobuf, sparse) | `struct` with default values |
| Events | `-> EventName(...)` server-sent | No built-in events |
| Error syntax | `-> () error uint32` | Must encode in return struct |
| Capability types | `client_end:P`, `server_end:P` | `interface P` as field type |

FIDL's `table` type is analogous to Cap'n Proto structs in terms of
evolvability (can add fields without breaking), but Cap'n Proto structs are
more compact on the wire (fixed-size inline section + pointers) while FIDL
tables use an envelope-based encoding.

### Wire Format Comparison

**FIDL wire format:**
- Little-endian, 8-byte aligned.
- Messages have a 16-byte header: `txid` (4 bytes), flags (3 bytes),
  magic byte (`0x01`), ordinal (8 bytes).
- Structs are laid out inline with natural alignment and explicit padding.
- Out-of-line data (strings, vectors, tables) uses offset-based indirection
  via "envelopes" (inline 8-byte entry: 4 bytes num_bytes, 2 bytes num_handles,
  2 bytes flags).
- **Handles are out-of-band.** The wire format contains `ZX_HANDLE_PRESENT`
  (0xFFFFFFFF) or `ZX_HANDLE_ABSENT` (0x00000000) markers where handles
  appear. The actual handles are in the channel message's handle array,
  consumed in order of appearance in the linearized message.
- Encoding is done into a contiguous byte buffer + a separate handle array,
  matching the channel write API.
- **No pointer arithmetic.** FIDL v2 uses a "depth-first traversal order"
  encoding where out-of-line objects are laid out sequentially. Offsets are
  not stored; the decoder walks the type schema to find boundaries.

**Cap'n Proto wire format:**
- Little-endian, 8-byte aligned (word-based).
- Messages have a segment table header listing segment sizes.
- Structs have a fixed data section + pointer section. Pointers are relative
  offsets (self-relative, in words).
- Uses pointer-based random access: can read any field without parsing the
  entire message.
- **Capabilities are indexed.** Cap'n Proto's RPC protocol assigns capability
  table indices to interface references in messages. The actual capability
  (file descriptor, handle, etc.) is transferred out-of-band.
- Supports multi-segment messages (FIDL is always single-segment).
- Zero-copy read: can read directly from the wire buffer without
  deserialization.

**Key wire format differences:**

| Property | FIDL | Cap'n Proto |
|---|---|---|
| Random access | No (sequential decode) | Yes (pointer-based) |
| Zero-copy read | Partial (decode-on-access for some types) | Full (read from buffer) |
| Segments | Single contiguous buffer | Multi-segment |
| Pointers | Implicit (traversal order) | Explicit (relative offsets) |
| Size overhead | Smaller (no pointer words) | Larger (pointer section) |
| Decode cost | Must validate sequentially | Can validate lazily |
| Handle/cap encoding | Presence markers + out-of-band array | Cap table indices + out-of-band |

### FIDL Capability Transfer

FIDL has first-class syntax for capability transfer in protocols:

```fidl
protocol FileSystem {
    Open(resource struct {
        path string:256;
        flags uint32;
        object server_end:File;
    }) -> ();
};

protocol File {
    Read(struct { count uint64; }) -> (struct { data vector<uint8>:MAX; });
    GetBuffer(struct { flags uint32; }) -> (resource struct { buffer zx.Handle:VMO; });
};
```

- `server_end:File` -- a channel endpoint where the server will serve the
  `File` protocol. The client creates a channel, keeps the client end, and
  sends the server end through this call.
- `client_end:File` -- a channel endpoint for a client of the `File` protocol.
- `zx.Handle:VMO` -- a handle to a specific kernel object type (VMO).
- The `resource` keyword marks types that contain handles (and thus cannot be
  copied, only moved).

The FIDL compiler tracks handle ownership: types containing handles are
"resource types" with move semantics. This is enforced at the language binding
level (e.g., in C++, resource types are move-only; in Rust, they implement
`Drop` but not `Clone`).

### Comparison to capOS's Cap'n Proto Usage

Cap'n Proto natively supports capability transfer through its `interface`
types:

```capnp
interface FileSystem {
    open @0 (path :Text, flags :UInt32) -> (file :File);
}

interface File {
    read @0 (count :UInt64) -> (data :Data);
    getBuffer @1 (flags :UInt32) -> (buffer :MemoryObject);
}
```

In standard Cap'n Proto RPC, `file :File` in the return type means "a
capability to a File interface." The RPC system assigns a capability table
index, transfers it out-of-band, and the receiver gets a live reference to
invoke further methods.

**Recommendations for capOS:**

1. **Use out-of-band capability transfer beside Cap'n Proto payloads.** Cap'n
   Proto RPC has capability descriptors indexed into a capability table, but
   capOS currently keeps kernel transfer semantics in ring sideband records so
   the kernel can treat Cap'n Proto payload bytes as opaque. Promise pipelining
   should build on that sideband result-cap namespace rather than requiring
   general payload traversal in the kernel.

2. **No need to switch to FIDL.** Cap'n Proto's wire format is superior for
   capOS's use case:
   - **Random access** means runtimes and services can inspect specific fields
     without full deserialization. The kernel should keep using bounded
     sideband metadata for transport decisions.
   - **Zero-copy read** means less allocation in userspace protocol handling.
   - **Multi-segment** messages allow avoiding large contiguous allocations.
   - **Promise pipelining** is native to Cap'n Proto RPC, aligning with
     capOS's planned async ring chaining.

3. **FIDL's `resource` keyword is worth imitating.** Mark capnp types that
   contain capabilities differently from pure-data types. This could be done
   at the schema level (Cap'n Proto already distinguishes `interface` fields)
   or as a convention. This enables the kernel to fast-path messages that
   contain no capabilities (no need to scan for capability descriptors).

4. **FIDL's `table` type for evolution.** Cap'n Proto structs already support
   adding fields, but capOS should be aware that FIDL tables are more
   explicitly designed for cross-version compatibility. For system interfaces
   that will evolve, consider using Cap'n Proto groups or designing structs
   with generous ordinal spacing.


## 6. Synthesis: Relevance to capOS

### Handle Model vs. Typed Capability Dispatch

Zircon's handle model is **untyped at the handle level** -- a handle is just
(object_ref, rights). The type comes from the object. All operations go through
fixed syscalls (`zx_channel_write`, `zx_vmo_read`, etc.).

capOS's model is **typed at the capability level** -- each capability
implements a Cap'n Proto interface with method dispatch. Operations go through
ring SQEs such as `CAP_OP_CALL`, with Cap'n Proto params and results carried
in userspace buffers.

Both are valid. Zircon's approach is lower overhead (no serialization for simple
operations like `vmo_read`), while capOS's approach gives uniformity (every
operation has the same wire format, enabling persistence and network
transparency).

**Hybrid recommendation:** For performance-critical operations (memory mapping,
signal waiting), consider adding "fast-path" syscalls that bypass capnp
serialization, similar to how Zircon has dedicated syscalls per object type.
The capnp path remains the general mechanism and the "canonical" interface.

### Async Rings vs. Ports: The Right Call

capOS's io_uring-inspired async rings are a better fit than Zircon's port model
for a capability OS:

1. Ports require separate syscalls for registration, waiting, and the actual
   operation. Async rings batch everything.
2. Cap'n Proto's promise pipelining maps naturally to SQE chaining.
3. The shared-memory ring design avoids kernel-side queuing overhead.

However, learn from ports:
- The signal model (each object has a signal set, watchers are notified) is
  clean and composable. Consider making "wait for signal" a CQ event type.
- `zx_port_queue()` (user-initiated packets) is useful for waking up event
  loops from user code. Support user-initiated CQ entries.

### VMO/VMAR vs. capOS Memory Model

capOS should implement VMO-equivalent capabilities after the current Endpoint
and transfer baseline:
- IPC already has shared rings, but bulk data still needs explicit shared
  memory objects.
- Capability transfer of memory regions (passing a MemoryObject cap through
  IPC) is the standard pattern for bulk data transfer.
- CoW cloning enables efficient process creation.

Proposed capability interfaces:

```capnp
interface MemoryObject {
    read @0 (offset :UInt64, count :UInt64) -> (data :Data);
    write @1 (offset :UInt64, data :Data) -> ();
    getSize @2 () -> (size :UInt64);
    setSize @3 (size :UInt64) -> ();
    createChild @4 (offset :UInt64, size :UInt64, options :UInt32) -> (child :MemoryObject);
}

interface AddressRegion {
    map @0 (offset :UInt64, vmo :MemoryObject, vmoOffset :UInt64, len :UInt64, flags :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, len :UInt64) -> ();
    protect @2 (addr :UInt64, len :UInt64, flags :UInt32) -> ();
    allocateSubRegion @3 (offset :UInt64, size :UInt64) -> (region :AddressRegion, addr :UInt64);
}
```

### FIDL vs. Cap'n Proto: Stay with Cap'n Proto

Cap'n Proto is the right choice for capOS. The advantages over FIDL:

1. **Language-independent standard.** FIDL is Fuchsia-only. Cap'n Proto has
   implementations in C++, Rust, Go, Python, Java, etc.
2. **Zero-copy random access.** The kernel can inspect message fields without
   full deserialization.
3. **Promise pipelining.** Native to capnp-rpc, enabling the async ring
   chaining that capOS plans.
4. **Persistence.** Cap'n Proto messages are self-describing (with schema) and
   suitable for on-disk storage -- important for capOS's planned capability
   persistence.

The one thing FIDL does better: tight integration of handle/capability metadata
in the type system (the `resource` keyword, `client_end`/`server_end` syntax,
handle type constraints). capOS should ensure its capnp schemas clearly
distinguish capability-carrying types and that the kernel enforces capability
transfer semantics.

### Concrete Action Items for capOS

Ordered by priority and dependency:

1. **Keep typed-interface authority model**. Do not add a Zircon-style generic
   rights bitmask until a concrete method-attenuation need beats narrow wrapper
   capabilities and transfer-mode metadata.

2. **Handle generation counters**. Done: upper bits of `CapId` detect stale
   references.

3. **Design MemoryObject/SharedBuffer capability**. Define and implement the
   shared-memory object that replaces raw-frame transfer for bulk IPC.

4. **Design AddressRegion capability** (Stage 5). Sub-VMAR-like sandboxing.
   The root VMAR handle is part of the initial capability set.

5. **Capability transfer sideband**. Baseline CALL/RETURN copy and move
   transfer is implemented; promise-pipelined result-cap mapping still needs a
   precise rule before pipeline dispatch lands.

6. **Async rings with signal delivery**. SQ/CQ capability rings are
   implemented for transport; notification objects and promise pipelining
   remain future work.

7. **User-queued CQ entries** (with async rings). Allow userspace to post
   wake-up events to its own CQ, enabling pure-userspace event loop
   integration.


## Appendix: Key Zircon Syscall Reference

For reference, the most architecturally significant Zircon syscalls:

| Syscall | Purpose |
|---|---|
| `zx_handle_close` | Close a handle |
| `zx_handle_duplicate` | Duplicate with rights reduction |
| `zx_handle_replace` | Atomic replace with new rights |
| `zx_channel_create` | Create channel pair |
| `zx_channel_read` | Read message + handles from channel |
| `zx_channel_write` | Write message + handles to channel |
| `zx_channel_call` | Synchronous write-then-read (RPC) |
| `zx_port_create` | Create async port |
| `zx_port_wait` | Wait for next packet |
| `zx_port_queue` | Enqueue user packet |
| `zx_object_wait_async` | Register signal wait on port |
| `zx_object_wait_one` | Synchronous wait on one object |
| `zx_vmo_create` | Create virtual memory object |
| `zx_vmo_read` / `write` | Direct VMO access |
| `zx_vmo_create_child` | CoW clone |
| `zx_vmar_map` | Map VMO into address region |
| `zx_vmar_unmap` | Unmap |
| `zx_vmar_allocate` | Create sub-VMAR |
| `zx_process_create` | Create process (with root VMAR) |
| `zx_process_start` | Start process execution |

## Used By

- [Capability Model](../capability-model.md) for the comparison with
  generation-tagged flat cap tables and rights-bit alternatives.
- [Memory Management](../architecture/memory.md) for VMO/VMAR-style separation
  between object backing and virtual address-space mappings.
- [Go VirtualMemory Contract](../backlog/go-virtual-memory-contract.md) for
  commit/decommit and reservation precedent.
