# Proposal: Storage, Naming, and Persistence

What replaces the filesystem in a capability OS where Cap'n Proto is the
universal wire format.


## The Problem with Filesystems

In Unix, the filesystem is the universal namespace. Everything is a path:
`/dev/sda`, `/etc/config`, `/proc/self/fd/3`, `/run/dbus/system_bus_socket`.
Paths are ambient authority — any process can open `/etc/passwd` if the
permission bits allow. The filesystem conflates naming, access control,
persistence, and device abstraction into one mechanism.

capOS has capabilities instead of paths. Access control is structural (you
can only use what you were granted), not advisory (permission bits checked at
open time). This means:

- No global namespace needed — each process sees only its granted caps
- No path-based access control — the cap IS the access
- No distinction between "file", "device", "socket" — everything is a typed
  capability interface

A traditional VFS would reintroduce ambient authority through the back door.
Instead, capOS needs a storage and naming model native to capabilities and
Cap'n Proto.

## Core Insight: Cap'n Proto Everywhere

Cap'n Proto is already used in capOS for:

- **Interface definitions** — `.capnp` schemas define capability contracts
- **IPC messages** — capability invocations are capnp messages
- **Serialization** — capnp wire format crosses process boundaries

If we extend this to storage, then:

- **Stored objects** are capnp messages
- **Configuration** is capnp structs
- **Binary images** are capnp-wrapped blobs
- **The boot manifest** is a capnp message describing the initial capability
  graph

No format conversion anywhere. The same tools (schema compiler, serializer,
validator) work for IPC, storage, config, and network transfer.

## Architecture

### Three Layers

Target architecture after the manifest executor and process-spawner work:

```
Boot Image (read-only, baked into ISO)
  │
  │  capnp-encoded manifest + binaries
  │
  v
Kernel (creates initial caps from manifest)
  │
  │  grants caps to init
  │
  v
Init (builds live capability graph)
  │
  ├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
  │
  ├──> Store service (capability-native content-addressed storage)
  │      backed by: virtio-blk, RAM, or network
  │
  └──> All other services (receive Directory, Store, or Namespace caps)
```

### Layer 1: Boot Image

The boot image (ISO/disk) contains a capnp-encoded **system manifest** loaded
as a Limine module alongside the kernel. The manifest describes:

```capnp
struct SystemManifest {
    # Manifest schema version, validated before other fields
    schemaVersion @0 :UInt32;
    # Binaries available at boot, keyed by name
    binaries @1 :List(NamedBlob);
    # Init's config blob: first-process metadata plus service graph
    initConfig @2 :CueValue;
    # Kernel boot parameters
    kernelParams @3 :SystemConfig;
}

struct NamedBlob {
    name @0 :Text;
    data @1 :Data;
}

struct CueValue {
    union {
        null @0 :Void;
        boolean @1 :Bool;
        intValue @2 :Int64;
        uintValue @3 :UInt64;
        text @4 :Text;
        bytes @5 :Data;
        list @6 :List(CueValue);
        fields @7 :List(CueField);
    }
}

struct CueField {
    name @0 :Text;
    value @1 :CueValue;
}
```

Capability source identity is already structured in the bootstrap manifest,
so source selection does not depend on parsing authority strings:

```cue
{
    name:                "client"
    expectedInterfaceId: 0xacf0c15a7b2e0041
    source: service: {
        service: "endpoint-server"
        export:  "client"
    }
}
```

Kernel and service source objects inside `initConfig` select the authority to grant. The
`expectedInterfaceId` field carries the generated Cap'n Proto interface
`TYPE_ID` and only checks that the granted object speaks the expected schema.
It cannot replace source identity: many different objects may expose the same
interface while representing different authority.

The build system (`Makefile`) generates this manifest from a human-authored
description and packs it into the ISO as `manifest.bin`. Current code embeds
every `SystemManifest.binaries` entry into that manifest as `NamedBlob` data,
including the release-built init and smoke-demo ELFs. The kernel now boots only
`initConfig.init`; focused init-executor manifests expose the manifest to the
separate `init` binary as a read-only `BootPackage` capability, while default
shell-led manifests boot `capos-shell` directly without a BootPackage executor.
Remaining cleanup is to narrow the long-term boot package shape after the
single-init split.

Using a `CueValue` tree instead of `AnyPointer` keeps the manifest directly
decodable in `no_std` userspace without depending on Cap'n Proto reflection.

#### Transitional Schema Note

`ServiceEntry`, `CapSource::Service`, and `ServiceEntry.exports` are no longer
kernel schema fields. `ProcessSpawner`, copy/move cap transfer, focused
init-owned generic manifest execution, the default standalone-init service
graph, focused shell-led login smokes, and the 15.4 `initConfig` schema split
are implemented. The current boot manifest shape is:

```capnp
struct SystemManifest {
    # Manifest schema version, validated before other fields
    schemaVersion @0 :UInt32;
    # Binaries available at boot, keyed by name
    binaries @1 :List(NamedBlob);
    # Init's config blob (replaces the service graph)
    initConfig @2 :CueValue;
    # Kernel boot parameters (serial policy, shell MOTD, feature flags)
    kernelParams @3 :SystemConfig;
}
```

`ServiceEntry` / `CapRef` disappeared from the schema and became plain CUE
fields inside `initConfig.services`. Init reads them at runtime and calls
`ProcessSpawner` directly. `validate_manifest_graph`,
`validate_bootstrap_cap_sources`, and the remaining transitional service-graph
schema are no longer kernel bootstrap checks. They remain in `capos-config` for
mkmanifest and the focused init executor while that executor still accepts the
transitional service graph. Kernel bootstrap already uses a first-service
cap-table builder rather than the old multi-service resolver. See
`docs/proposals/service-architecture-proposal.md` — "Legacy Manifest Fields
After Stage 6" for the deprecation plan.

During the current transition, `initConfig.init` is still per-manifest launch
metadata: it selects the single boot process binary and the kernel-sourced caps
for that process. `initConfig.services`, cross-service cap sources, exports,
and restart policy are init-owned configuration for focused executor manifests.
Focused harnesses that boot a demo as init keep using that first-process cap
bundle until those smokes are migrated behind a fixed generic init.

### Layer 2: Kernel Bootstrap

Target design for the kernel's boot role:

1. Parse the system manifest (read-only capnp message from Limine module).
2. Hash the embedded binaries for optional measured-boot attestation.
3. Create kernel-provided capabilities: `Console`, `Timer`, `DeviceManager`,
   `ProcessSpawner`, `FrameAllocator`, `VirtualMemory` (per-process), and a
   read-only `BootPackage` cap exposing `SystemManifest.binaries` and
   `initConfig`.
4. Spawn init — exactly one userspace process — with that cap bundle.

Current boot has reached the single-init split and the `initConfig` schema
split. `system.cue` puts the standalone `init` binary in `initConfig.init` for
the default service-graph process; init reads `BootPackage` and starts the
shell, remote-session CapSet gateway, and resident services from
`initConfig.services`.
Focused shell-led manifests such as `system-smoke.cue` still put
`capos-shell` in `initConfig.init` for narrow login proofs. Focused
init-executor manifests such as `system-spawn.cue` also put the separate
`init` binary in `initConfig.init`; that binary reads `BootPackage` and spawns
the focused demo graph from `initConfig.services` through `ProcessSpawner`.
The unused kernel resolver has been retired. The remaining cleanup is replacing
per-manifest init bundles with a fixed generic-init bootstrap ABI.

### Layer 3: Init and the Live Capability Graph

Target init reads `initConfig` from the `BootPackage` cap and executes it:

```rust
fn main(caps: CapSet) {
    let spawner = caps.get::<ProcessSpawner>("spawner");
    let boot = caps.get::<BootPackage>("boot");
    let config = boot.init_config()?;  // CueValue

    // Walk service entries from the config and spawn in dependency order
    for entry in config.field("services")?.iter()? {
        let binary = boot.binary(entry.field("binary")?.as_str()?)?;
        let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
        let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
        running_services.insert(entry.field("name")?.as_str()?.into(), handle);
    }

    supervisor_loop(&running_services);
}
```

In this target model, init is a generic manifest executor rather than a
hardcoded service graph. The system topology is defined in the boot
package's `initConfig`, not in init's source code. Changing what services
run means rebuilding the boot image with a different config blob, not
recompiling init. Manifest graph resolution stops being a kernel concern.

The current transition uses `initConfig.services` as the service graph; init
reads the BootPackage manifest, validates a metadata-only
`ManifestBootstrapPlan`, resolves kernel and service cap sources, records
exported caps, spawns children in manifest order, and waits for their
`ProcessHandle`s.

## Two Storage Models

capOS supports two complementary storage models, both exposed as typed
capabilities:

### Filesystem Capabilities (Directory, File)

For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and
for POSIX compatibility. A filesystem service wraps a `BlockDevice` and
exports `Directory`/`File` capabilities.

```
BlockDevice (raw sectors)
    │
    └──> Filesystem service (FAT, ext4, ...)
              │
              ├──> Directory caps (namespace over files)
              └──> File caps (read/write byte streams)
```

This model maps naturally to USB flash drives, NVMe partitions, and
network-mounted filesystems. The `open()` and `sub()` operations return new
capabilities via IPC cap transfer (see "IPC and Capability Transfer" below).

### Capability-Native Store (Store, Namespace)

For capOS-native data: configuration, service state, content-addressed object
storage. A store service wraps a `BlockDevice` and exports `Store`/`Namespace`
capabilities.

```
BlockDevice (raw sectors)
    │
    └──> Store service
              │
              ├──> Store cap (content-addressed put/get/list inventory)
              └──> Namespace caps (mutable name→hash mappings)
```

Content-addressing provides automatic deduplication, verifiable integrity,
and immutable references. `Store.list` returns the live inventory of content
hashes in that Store, so holders that need crash/reboot recovery can rediscover
stored content without a separate mutable root pointer. Namespaces add mutable
bindings on top when callers need stable names rather than inventory scans.

### Bridging the Two Models

The models are composable. An adapter service can bridge between them:

- **FsStore adapter**: exposes a Directory tree as a content-addressed Store
  (hash each file's contents, directory listings become capnp-encoded objects)
- **StoreFS adapter**: exposes Store/Namespace as a Directory tree (each name
  maps to a File whose contents are the stored object)
- **Import/export**: a utility service reads files from a Directory and stores
  them in a Store, or materializes Store objects as files in a Directory

In both cases the adapter is a userspace service holding caps to both
subsystems. No kernel mechanism needed — just capability composition.

## File I/O Interfaces

Directory, File, Store, and Namespace caps may be scoped to a user session,
guest profile, anonymous request, or service identity, but the cap remains the
authority. POSIX ownership metadata is compatibility data inside these
services, not a system-wide authorization channel. See
[user-identity-and-policy-proposal.md](user-identity-and-policy-proposal.md).

### BlockDevice

Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass
storage). The driver receives hardware capabilities (MMIO, IRQ,
FrameAllocator for DMA) and exports a `BlockDevice` cap.

```capnp
interface BlockDevice {
    readBlocks  @0 (startLba :UInt64, count :UInt32) -> (data :Data);
    writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
    info        @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
    flush       @3 () -> ();
}
```

For bulk transfers, `readBlocks`/`writeBlocks` accept a `SharedBuffer`
capability instead of inline `Data` (see "Shared Memory for Bulk Data"
below). The inline-Data variants work for metadata reads and small
operations; the SharedBuffer variants avoid copies for large I/O.

### File

Byte-stream access to a single file. Served by filesystem services. Created
dynamically when a client calls `Directory.open()` — the filesystem service
creates a `File` CapObject for the opened file and transfers it to the
caller via IPC cap transfer.

```capnp
interface File {
    read     @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write    @1 (offset :UInt64, data :Data) -> (written :UInt32);
    stat     @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate @3 (length :UInt64) -> ();
    sync     @4 () -> ();
    close    @5 () -> ();
}
```

`close` releases the server-side state for this file (open cluster chain
cache, dirty buffers). The kernel-side CapTable entry is removed by the system
transport via `CAP_OP_RELEASE` when the local holder releases it; `capos-rt`
owned handles queue local releases on final drop and expose explicit release
flushing for ordinary userspace. `CapabilityManager` is
management-only (`list()`, later `grant()`); it does not expose a `drop()`
method because ordinary handle lifetime belongs to the transport, not to an
application call on the same table that dispatches it.

Attenuation: a read-only File wraps the original and rejects `write`,
`truncate`, `sync` calls. An append-only File rejects `write` at offsets
other than the current size.

### Directory

Namespace over files on a filesystem. Served by filesystem services.
`open()` and `sub()` return new capabilities via IPC cap transfer.

```capnp
interface Directory {
    open    @0 (name :Text, flags :UInt32) -> (file :File);
    list    @1 () -> (entries :List(DirEntry));
    mkdir   @2 (name :Text) -> (dir :Directory);
    remove  @3 (name :Text) -> ();
    sub     @4 (name :Text) -> (dir :Directory);
    create  @5 (name :Text) -> ();
    rename  @6 (from :Text, to :Text) -> ();
}

struct DirEntry {
    name  @0 :Text;
    size  @1 :UInt64;
    isDir @2 :Bool;
}
```

`sub()` returns a Directory scoped to a subdirectory — the analog of chroot.
The caller cannot traverse upward or see the parent directory. `open()` with
create flags creates a new file if it doesn't exist.

The `flags` field in `open()` is a bitmask: `CREATE = 1`, `TRUNCATE = 2`,
`APPEND = 4`. No `READ`/`WRITE` flags — those are determined by the
Directory cap's attenuation (a read-only Directory returns read-only Files).

#### Writable Directory Mutations and the Single-Writer Policy

`create @5` makes a new empty file and `rename @6` renames an entry within the
same parent. Both have additive ordinals so the read-only `Directory`
implementations stay wire-compatible — they simply reject the mutating methods
(`mkdir`/`remove`/`sub`/`create`/`rename`) fail-closed, the way a read-only
`File` rejects `write`. Unlike `open` with `CREATE`, `create` fails closed if the
name already exists; `rename` fails closed if the source is absent or the
destination already exists, and does not support cross-directory moves.

The first writable filesystem service adopts a **fail-closed single-writer
policy**: a writable filesystem tree admits one writer at a time. The first
granted cap to perform a mutation claims the writer slot; a mutation through any
other concurrently granted cap fails closed with a typed `Failed` exception
(`"writable filesystem rejects a second concurrent writer (single-writer
policy)"`) rather than racing. There is no lease/release lifecycle — the first
writer keeps the slot — and `list`/`sub` reads are allowed for any holder. This
deliberately closes the milestone's concurrent-writer-policy decision without
expanding scope to advisory locks, lock leases, or multi-writer coordination
(see Open Question 6). The implementation (`kernel/src/cap/writable_fs.rs`, proof
`make run-storage-writable`) is now disk-backed: it mounts a `CAPOSWF1`
sub-volume (a flat node-record array with parent pointers plus a bump-allocated
data region) over the kernel-owned virtio-blk driver, keeps the RAM tree as the
working copy, and write-through-commits every directory/file mutation in the
order data sector → node-record sector → superblock (the ordering commit point),
mirroring the disk-backed `Store`. The persistent `Store` `CAPOSST1` sub-volume
co-locates on the same disk image (at LBA 0; the filesystem superblock sits at a
fixed higher LBA), so filesystem mutations and store object writes/deletes
survive a reboot together — `make run-storage-writable` boots QEMU twice against
one combined image and phase 2 verifies every surviving name, size, content,
directory entry, and store object plus the deleted object's absence.

Unclean-shutdown recovery is proven by `make run-storage-writable-recovery`. A
slot becomes live on the next mount only once the superblock's bumped
`node_count` is observed, so a forced poweroff in the window between a node
record's durable write and that commit leaves an orphan slot the next mount
ignores: the interrupted allocation is atomically absent, never a torn or
half-live entry. The proof builds the kernel with the proof-only
`storage_writable_recovery` feature, which arms an induced forced poweroff in
exactly that window (`recovery_crash_after_record`); pass 1 commits durable
mutations and a `Store` survivor and then triggers the window (the harness
`kill -9`s QEMU after the kernel marker), and pass 2 re-mounts and verifies
recovery to a consistent tree with the committed state intact, the interrupted
allocation absent, no torn record, and a usable post-recovery write. The proof
is bounded to that single record-vs-commit window under host-page-cache
durability (the virtio driver negotiates no `VIRTIO_BLK_F_FLUSH`, and a
`kill -9` preserves the host page cache); it proves the superblock-commit
ordering invariant, not a general media crash-consistency guarantee against
host power loss or a lost write-back cache. The co-located `CAPOSST1` `Store`
now has bounded tombstone reclamation through `make run-storage-persist`; this
does not add a new media power-loss guarantee or reclaim writable-file extents.

Writable `File` content paths layer onto the same tree. `open` with the
`CREATE`/`TRUNCATE`/`APPEND` flags (or a write through the returned `File`)
claims the same filesystem-wide writer slot, so file writes obey the single
writer policy alongside directory mutations; a plain (`flags == 0`) open and the
`read`/`stat` methods are reads allowed for any holder. `write @1` overwrites or
extends at the supplied offset, zero-filling any gap; a handle opened `APPEND`
lands every write at end-of-file regardless of the offset argument. `truncate @3`
shrinks (discards the tail) or extends (zero-fills) the file, and `close @5`
releases only that handle — the file survives in the directory until
`Directory.remove`, which marks the file node so any outstanding `File` cap fails
closed. File content is bounded by `MAX_FILE_BYTES` (64 KiB) and persists to a
bump-allocated disk extent on each mutation; a rewrite that outgrows the current
extent allocates a fresh one and leaks the old (file-extent compaction deferred).
Because
each `write`/`truncate` already wrote through the block device (the virtio
driver negotiates no `VIRTIO_BLK_F_FLUSH`, so there is no separate media barrier
to issue), `sync @4` succeeds as an honest write-side no-op (a read-only `File`
still rejects it). Crash consistency rests on the superblock-commit ordering
rather than a media barrier: an interrupted allocation is atomically absent on
remount (proven by `make run-storage-writable-recovery`, above). A post-write
media-durability flush against a write-back cache (for host power loss, not the
guest-side forced poweroff that proof exercises) remains future hardening, not
claimed here.

### Syscall Trace: Reading a File from a FAT USB Drive

Four userspace processes: App, FAT service, USB mass storage, xHCI driver.

**With promise pipelining (one submission):**

Cap'n Proto promise pipelining lets the App chain dependent calls without
waiting for intermediate results. The App submits a single pipelined
request: "open this file, then read from the result":

```
# Single pipelined submission (SQEs with PIPELINE flag):
#   call 0: dir.open("report.pdf")         → answer_id=200, user_data=100
#   call 1: answer 200 result_cap[0].read(offset=0, len=4096)

cap_submit([
    {cap=2, method=OPEN, answer=200, user_data=100, params={"report.pdf", flags=0}},
    {cap=PIPELINE(answer=200, result_cap=0), method=READ, user_data=101, params={offset:0, length:4096}},
])
  → kernel routes call 0 to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject, replies with File cap as result cap 0
  → kernel sees pipelined call 1 targeting the File cap from call 0
  → kernel dispatches call 1 to the same FAT service (or direct-invokes
    the new File CapObject if it's a local endpoint)
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → USB mass storage → xHCI → hardware → back up
  ← completion: {data: [4096 bytes]}, File cap installed as cap_id=5
```

One app-to-kernel transition. The kernel resolves the pipeline dependency
internally through the sideband `CapTransferResult` record at index 0; it does
not inspect the Cap'n Proto result payload. The App never needs a userspace
round trip for the intermediate File cap, though the cap is installed and usable
afterward.

This is a core Cap'n Proto feature: by expressing "call method on the
not-yet-resolved result of another call," the client avoids a round-trip
for each link in the chain. For deeper chains (e.g., `dir.sub("a").sub("b")
.open("file").read(0, 4096)`), the savings compound — one submission instead
of four sequential syscalls.

The capability-ring version should follow the Cap'n Proto/CapTP prior-art
shape captured in
[Cloudflare, Cap'n Proto, Workers RPC, and Cap'n Web](../research/cloudflare-capnproto-workers.md)
and [Spritely, OCapN, and CapTP](../research/spritely-captp-ocapn.md):
pipelined targets live in answer/result-cap namespaces, not in caller-selected
global ids; result-cap metadata stays outside the Cap'n Proto payload; broken
answers propagate failure to dependent calls; and answer slots, queued
dependent calls, queued bytes, and remote references are charged to bounded
resource ledgers. This is design grounding, not an OCapN or Cap'n Web
wire-compatibility target.

**Without pipelining (two sequential ring submissions):**

Without promise pipelining, the App submits two separate CALL SQEs via the
ring, blocking on each completion before submitting the next:

```
# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject for this file
  → FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
  → kernel installs File cap in App's table → cap_id=5
  ← App reads CQE: result={file: cap_index=0}, new_caps=[5]

# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → kernel routes to USB mass storage
      → mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
          → kernel routes to xHCI driver
          → xHCI programs TRBs, waits for interrupt
          ← returns raw sector data
      ← returns sector data
  ← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}
```

This works but costs two round-trips where pipelining needs one. The
synchronous path is useful for simple cases and bootstrapping; pipelining
is the intended steady-state model.

In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI)
are invisible to the App.

## Capability-Native Store

### The Store Capability

Once the system is running, persistent storage is provided by a userspace
service — the **store**. It's backed by a block device (virtio-blk), and
exposes a content-addressed object store where objects are capnp messages.

```capnp
interface Store {
    # Store a capnp message, returns its content hash
    put @0 (data :Data) -> (hash :Data);
    # Retrieve by hash
    get @1 (hash :Data) -> (data :Data);
    # Check existence
    has @2 (hash :Data) -> (exists :Bool);
    # Delete (if caller has authority — see note below)
    delete @3 (hash :Data) -> ();
}
```

**Note on `delete`:** In a content-addressed store, deleting a hash can break
references from other namespaces pointing to the same object. `delete` on the
base `Store` interface is dangerously broad — a `StoreAdmin` interface
(separate from `Store`) may be more appropriate, with `delete` restricted to a
GC service that can verify no live references exist. Open Question #3 (GC)
should be resolved before implementing `delete`. The attenuation table below
lists `Store (full)` as "Read, write, delete any object" — in practice, most
callers should receive a `Store` attenuated to put/get/has only.

Content-addressed means:
- Deduplication is automatic (same content = same hash)
- Integrity is verifiable (hash the data, compare)
- References between objects are just hashes embedded in capnp messages
- No mutable paths — "updating a file" means storing a new version and
  updating the reference

### Mutable References: Namespaces

A `Namespace` capability provides mutable name-to-hash mappings on top of
the immutable store:

```capnp
interface Namespace {
    # Resolve a name to a store hash
    resolve @0 (name :Text) -> (hash :Data);
    # Bind a name to a hash (if caller has write authority)
    bind @1 (name :Text, hash :Data) -> ();
    # List names (if caller has list authority)
    list @2 () -> (names :List(Text));
    # Get a sub-namespace (attenuated — restricted to a prefix)
    sub @3 (prefix :Text) -> (ns :Namespace);
}
```

A `Namespace` cap scoped to `"config/"` can only see and modify names under
that prefix. This is the analog of a chroot — but structural, not a kernel
hack. The `sub()` method returns a new Namespace cap via IPC cap transfer.

**Future: union composition.** The [research survey](../research/capability-systems-survey.md) recommends
extending `Namespace` with Plan 9-inspired union semantics — a `union(other,
mode)` method that merges two namespaces with before/after/replace ordering.
This adds composability without a global mount table. See
[research survey](../research/capability-systems-survey.md) §6.

## IPC and Capability Transfer

Several storage operations return new capabilities: `Directory.open()`
returns a File, `Directory.sub()` returns a Directory, `Namespace.sub()`
returns a Namespace. This requires dynamic capability management — the kernel
must install new capabilities in a process's CapTable at runtime as part of
IPC.

### The Capability Ring

All kernel-userspace interaction goes through a **shared-memory ring pair**
(submission queue + completion queue), inspired by io_uring. SQE opcodes map
to capnp-rpc Level 1 message types. The ring is allocated per-process at
spawn time and mapped into the process's address space.

**Syscall surface: 2 syscalls.** New capabilities, operations, and transfer
mechanisms are expressed as new SQE opcodes instead of expanding the syscall
ABI.

| # | Syscall | Purpose |
|---|---|---|
| 1 | `exit(code)` | Terminate current thread; process exits after its last live thread |
| 2 | `cap_enter(min_complete, timeout_ns)` | Process pending SQEs, then wait until enough CQEs exist or the timeout expires |

Writing SQEs is **syscall-free**, but ordinary capability CALLs make progress
through `cap_enter`. Timer polling handles non-CALL ring work and only CALL
targets that explicitly opt into interrupt-context dispatch. `cap_enter`
flushes pending SQEs and can block the process until `min_complete`
completions are available or a finite timeout expires. An indefinite wait uses
`timeout_ns = u64::MAX`; `timeout_ns = 0` keeps the call non-blocking. A future
SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path
without running arbitrary capability methods from timer interrupt context.

The ring structs and synchronous CALL dispatch are implemented and working.
See `capos-config/src/ring.rs` for the shared ring structs and
`kernel/src/cap/ring.rs` for kernel-side processing.

### Ring Layout

One 4 KiB page per process, mapped into both kernel (HHDM) and user space:

```
┌─────────────────────────┐  offset 0
│ Ring Header              │  SQ/CQ head, tail, mask, flags
├─────────────────────────┤  offset 128
│ SQE Array (16 × 64B)    │  submission queue entries
├─────────────────────────┤  offset 1152
│ CQE Array (32 × 32B)    │  completion queue entries
└─────────────────────────┘

SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)
```

### SQE Opcodes

Five opcodes handle everything — client calls, server dispatch, capability
transfer, pipelining, and lifecycle:

| Opcode | capnp-rpc analog | Purpose |
|---|---|---|
| `CALL` | Call | Invoke method on a capability |
| `RETURN` | Return | Respond to incoming call (server side) |
| `RECV` | (implicit) | Wait for incoming calls on Endpoint |
| `RELEASE` | Release | Drop a capability reference |
| `FINISH` | Finish | Release pipeline answer state |
| `TIMEOUT` | — | Post a CQE after N nanoseconds (io_uring-inspired) |

`TIMEOUT` is an alternative to the `timeout_ns` argument on `cap_enter`:
it works with zero-syscall polling (kernel fires the CQE on a timer tick)
and composes with LINK/DRAIN for deadline-based chains.

SQE flags: `PIPELINE` (cap_id is a promise reference), `LINK` (chain to
next SQE), `MULTISHOT` (keep generating CQEs), `DRAIN` (barrier).

### Promise Pipelining

A CALL SQE can target either a concrete CapId or a **PromisedAnswer**
reference (via the `PIPELINE` flag + `pipeline_dep`/`pipeline_field` fields).
`pipeline_dep` names the earlier answer and `pipeline_field` is a zero-based
`CapTransferResult` record index in that answer's sideband result-cap list, not
a Cap'n Proto schema field. The kernel resolves the dependency chain internally:

```
SQE[0]: CALL dir.open("report.pdf")        → answer_id=200, user_data=100
SQE[1]: CALL [PIPELINE: dep=200, result_cap=0].read(0, 4096)  → user_data=101
```

One `cap_enter` call. The kernel dispatches SQE[0], resolves result cap record
0 from the completion sideband, and dispatches SQE[1] against it without
returning to userspace between steps or parsing the result payload.

### The Endpoint Kernel Object

For cross-process IPC, an **Endpoint** connects client-side proxy caps to a
server's receive loop:

```
Client's CapTable                                   Server's CapTable
┌─────────────────┐                                 ┌──────────────────┐
│ cap 2: Proxy     │                                 │ cap 0: Endpoint   │
│   → endpoint ────────── Endpoint ◄──── RECV SQE ──│                  │
│   badge: 42      │     (kernel obj)                │                  │
└─────────────────┘                                 └──────────────────┘
```

The server posts a `RECV` SQE (with `MULTISHOT` flag). Incoming calls appear
as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id.
The server responds by posting a `RETURN` SQE referencing the call_id.

`interface_id` is the transported schema ID for the interface being invoked.
It should equal the generated `TYPE_ID` for that capnp interface. `cap_id` is
the authority-bearing table handle; `interface_id` is only the protocol tag.
The target capability entry owns one public interface; `method_id` selects a
method inside that interface, while `cap_id` identifies the object being
invoked. If the same backing state needs another interface, the transport
should mint a separate capability entry for that interface rather than letting
one handle accept multiple unrelated `interface_id` values.

### Direct-Switch IPC

When a client's CALL targets a cap served by a blocked server (waiting on
RECV), the kernel marks that server as the direct IPC handoff target so the
next context-switch path runs the callee before unrelated round-robin work.
The current implementation still uses the ordinary saved-context restore path;
small-message register transfer remains a future fastpath after measurement.
See [research survey](../research/capability-systems-survey.md) §2.

### Capability Transfer via Ring

Capabilities travel as sideband arrays (`CapTransferDescriptor`) alongside capnp
message bytes:

- **CALL params**: params buffer contains the capnp message bytes followed by
  `xfer_cap_count` transfer descriptors packed at `addr + len`, which must be
  aligned to `CAP_TRANSFER_DESCRIPTOR_ALIGNMENT`.
- **RETURN results**: server result buffers carry the capnp reply bytes and may
  carry return transfer descriptors on `addr + len`; the kernel inserts
  destination capability records in the caller's result buffer after the normal
  result bytes. Count is reported in CQE `cap_count` and those records are
  written as `CapTransferResult { cap_id, interface_id }` values at
  `result_addr + result`. The requested result buffer (`result_len`) must be
  large enough for both normal reply bytes and all appended `cap_count`
  records.

`xfer_cap_count > 0` with malformed descriptor metadata (bad mode bits, reserved
bits, `_reserved0`, or misalignment) fails closed as
`CAP_ERR_INVALID_TRANSFER_DESCRIPTOR`. Kernels that have not yet enabled transfer
handling should return `CAP_ERR_TRANSFER_NOT_SUPPORTED` for transfer-bearing SQEs.

The capnp wire format's `WirePointerKind::Other` encodes capability indices
in messages. The sideband arrays map these indices to actual CapIds. The
kernel does not parse capnp messages — it transfers a list of caps alongside
the opaque message bytes.

### Dynamic Capability Management

Every `open()`, `sub()`, or `resolve()` creates and transfers a new
capability at runtime. The kernel's CapTable `insert()` and `remove()` are
the primitives. Capabilities flow through RETURN SQE sideband arrays (and
through the manifest at boot). No separate `cap_grant` mechanism needed —
authority flow follows the ring's IPC graph.

The CapTable generation counter handles stale references: when a File cap is
closed (slot freed, generation bumps), any cached CapId returns
`StaleGeneration` instead of accidentally hitting a new occupant.

## Shared Memory for Bulk Data

Copying file data through capnp `Data` fields works for metadata and small
reads, but is impractical for anything above a few KB. A 1 MB read through
a capability CALL copies data four times: device → driver heap → capnp
message → kernel buffer → client buffer.

### SharedBuffer Capability

`SharedBuffer` is the service-facing name this proposal uses for bulk-transfer
buffers. The implemented kernel/user substrate is `MemoryObject`: a capability
backed by physical pages that can be mapped into multiple address spaces
simultaneously. Zero copies between processes.

```capnp
interface MemoryObject {
    # Size and page count of the backing object.
    info @0 () -> (pageCount :UInt32, sizeBytes :UInt64);
    # Map a page-aligned object range into the caller's address space.
    map @1 (hint :UInt64, offset :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    # Unmap a caller-local borrowed mapping backed by this object.
    unmap @2 (addr :UInt64, size :UInt64) -> ();
    # Update caller-local page permissions for a borrowed mapping.
    protect @3 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
```

The kernel creates MemoryObjects through the existing `FrameAllocator`
capability. Held MemoryObject caps charge the holder's frame-grant quota; mapped
address-space pages are tracked as borrowed pages and keep the same backing
alive until unmapped or process teardown. A later `SharedBuffer` alias or
allocator may wrap this ABI for storage/network interfaces, but current code
should use `MemoryObject` directly.

### File I/O with SharedBuffer

File and BlockDevice interfaces support both inline-Data and SharedBuffer
modes:

```
# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}

# Large read: caller provides SharedBuffer, server fills it
let buf = frame_alloc.allocContiguous(256);  # 1 MB MemoryObject / SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel
```

Extended File interface with SharedBuffer support:

```capnp
interface File {
    read      @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write     @1 (offset :UInt64, data :Data) -> (written :UInt32);
    readBuf   @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
    writeBuf  @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
    stat      @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate  @5 (length :UInt64) -> ();
    sync      @6 () -> ();
    close     @7 () -> ();
}
```

The `readBuf`/`writeBuf` methods accept a SharedBuffer cap, currently a
MemoryObject cap transferred via IPC. The server maps the buffer, performs DMA
or memory copies into it, then returns. The caller reads directly from the
mapped pages.

For BlockDevice, the same pattern applies — the driver maps the SharedBuffer,
programs DMA descriptors pointing to its physical pages, and the device
writes directly into the shared memory.

### When to Use Each Mode

| Scenario | Mechanism | Why |
|---|---|---|
| Reading a 64-byte config value | `File.read()` inline Data | Copy overhead negligible |
| Reading a 10 MB binary | `File.readBuf()` SharedBuffer | Avoids 4× copy overhead |
| FAT directory entry (32 bytes) | `BlockDevice.readBlocks()` inline | Small metadata read |
| Streaming video frames | `File.readBuf()` + ring of SharedBuffers | Continuous zero-copy |
| Network packet buffers | SharedBuffer ring between NIC driver and net stack | DMA-capable pages |

## Attenuation

Storage services mint restricted capabilities using wrapper CapObjects:

| Capability | Authority |
|---|---|
| `Directory` (full) | Open, list, mkdir, remove, sub |
| `Directory` (read-only) | Open (returns read-only Files), list, sub only |
| `File` (full) | Read, write, truncate, sync |
| `File` (read-only) | Read and stat only |
| `File` (append-only) | Read, stat, write at end only |
| `Store` (full) | Read, write, delete any object |
| `Store` (read-only) | Get and has only |
| `Namespace` (full) | Resolve, bind, list under prefix |
| `Namespace` (read-only) | Resolve and list only |
| `Blob` (single object) | Read one specific hash |
| `SharedBuffer` (read-only) | Map as read-only (page table: R, no W) |

An application that only needs to read its config gets a read-only
`Directory` scoped to its config path. It can't write, can't see other
apps' directories, can't access the raw BlockDevice.

## Naming Without Paths

Traditional OS: process opens `/var/lib/myapp/data.db` — a global path.

capOS: process receives a `Directory` or `Namespace` cap at spawn time,
opens `"data.db"` within it. The process has no idea where on disk this
lives. It can't traverse upward. There is no global root.

```
# Traditional: global path namespace
/
├── etc/
│   └── myapp/
│       └── config.toml
├── var/
│   └── lib/
│       └── myapp/
│           └── data.db
└── sbin/
    └── myapp

# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
  "config" → Directory(read-only, scoped to myapp's config files)
  "data"   → Directory(read-write, scoped to myapp's data files)
  "state"  → Namespace(read-write, scoped to myapp's store objects)
  "log"    → Console cap
  "api"    → HttpEndpoint cap
```

The process doesn't know or care about the backing storage layout. It just
uses the capabilities it was granted.

## Configuration

### Build-Time Config (Boot Manifest)

The system manifest is authored at build time. The human-writable source
could be any format — TOML, CUE, or even a Makefile target that generates
the capnp binary. What matters is that it compiles to a `SystemManifest`
capnp message baked into the ISO.

Example source (TOML, compiled to capnp by a build tool):

```toml
[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
    { name = "device_mmio", source = { kernel = "device_mmio" } },
    { name = "interrupt", source = { kernel = "interrupt" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["nic"]

[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
    { name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
    { name = "timer", source = { kernel = "timer" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["net"]

[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
    { name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]

[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
    { name = "api", source = { service = { service = "http-service", export = "api" } } },
    { name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
    { name = "data", source = { service = { service = "store", export = "namespace" } } },
    { name = "log", source = { kernel = "console" } },
]
```

A build tool validates this against the capnp schemas (does `virtio-net`
actually export `"nic"`? does `http-service` support `endpoint()` minting?)
and produces the binary manifest.

### Runtime Config (via Store)

Once the store service is running, configuration can be stored there and
updated without rebuilding the ISO. The store is just another capability —
a config-management service could watch for changes and signal services to
reload.

## Connection to Network Transparency

If capabilities are the only abstraction, and capnp is the only wire format,
then the transport is irrelevant:

- **Local IPC**: capnp message copied between address spaces by kernel
- **Local store**: capnp message written to block device
- **Remote IPC**: capnp message sent over TCP to another machine
- **Remote store**: capnp message fetched from a remote store service

A capability reference doesn't encode where the backing service lives. The
kernel (or a proxy) handles routing. This means:

- A `Directory` cap could be backed by local FAT or a remote 9P server
- A `Namespace` cap could be backed by local storage or a remote store
- A `Fetch` cap could route through a local HTTP service or a remote proxy
- A `ProcessSpawner` cap could spawn locally or on a remote machine

The system manifest could describe services that run on different machines,
and the capability graph spans the network. This is the "network
transparency" item in the roadmap — it falls out naturally from the model.

## Persistence of the Capability Graph

The live capability graph (which process holds which caps) is ephemeral —
it exists in kernel memory and is lost on reboot. The system manifest
describes the *intended* graph, and init rebuilds it on each boot.

For true persistence (resume after reboot without re-initializing):

1. Each service serializes its state to the store before shutdown
2. On next boot, the manifest includes "restore from store hash X" hints
3. Services read their saved state from the store and resume

This is application-level persistence, not kernel-level. The kernel doesn't
snapshot the capability graph — services are responsible for their own state.
This avoids the complexity of EROS-style transparent persistence while still
allowing stateful services.

## Managed Cloud Backing

The local `Store`/`Namespace` interfaces define capOS persistence semantics. A
cloud backend must be an adapter behind those interfaces, not a new ambient
authority path. Services such as the adventure profile, expedition, and ledger
services should serialize bounded Cap'n Proto records to a store capability; the
caller should not know whether that store is backed by RAM, local disk, or a
managed cloud service.

For cloud-first application data, use a narrow bridge service:

```text
capOS service -> Store/Namespace or app-specific SaveStore cap -> Cloud bridge
              -> provider APIs
```

The bridge owns provider credentials and exposes only typed save/load/append
operations. Ordinary clients never receive provider credentials, bucket names,
database document paths, or broad write authority.

Recommended GCP mapping for game/profile style state:

- Firestore Native mode for small mutable indexes and profile summaries that
  need transactional compare-and-set behavior.
- Cloud Storage for larger immutable snapshots, evidence blobs, exports, and
  content-addressed objects. Object versioning and lifecycle policy should bound
  accidental overwrite recovery and storage growth.
- Cloud Run for a small HTTPS or capnp-over-HTTP bridge endpoint when capOS
  cannot yet link provider SDKs directly.
- Secret Manager for bridge-side service credentials and rotation; secrets do
  not enter ordinary capOS game clients.

Provider-specific records must still carry capOS-level schema version, content
hash or release id, profile/tenant id, monotonic version, size limit, and
migration policy. Writes that race on the same mutable profile or checkpoint
must use an explicit version precondition and fail closed when stale. Append-only
ledgers should append new records with previous-record hashes rather than
rewriting history. Local QEMU tests should use a fake cloud bridge that enforces
the same stale-write, append-only, wrong-profile, and size-bound rules before
any real provider integration is accepted.

### User-Owned Browser Transport

Some user data should be portable without giving the capOS service operator a
database role over it. For private player backup/sync, a browser can act as the
transport to user-owned storage:

```text
capOS save service -> encrypted save capsule -> browser
browser OAuth/Firebase session -> Google Drive appDataFolder or Firebase user doc
```

This is not the same as the managed cloud bridge above. In the browser-transport
model, the user grants Drive/Firebase access to the web app, the browser writes
opaque encrypted capsules, and capOS never receives the provider tokens. The
encryption key follows the storage domain: local capOS storage uses local
capOS-host key material, while GCP-backed game-world state uses Cloud KMS
envelope encryption: a per-world or per-shard KMS KEK wraps service-owned DEKs.
Google Drive's `appDataFolder` is a good fit for app-private backup files
because it is hidden from ordinary Drive views and can use the narrow
`drive.appdata` scope. Firebase/Firestore can also carry per-user encrypted
capsule documents and provide offline cache/sync behavior, but the backend
cannot validate encrypted game semantics beyond metadata and access rules.

Treat user-owned blobs as backup material, not authority:

- The service validates signatures, profile id, content hash, schema version,
  monotonic version, previous hash, and size bounds before import.
- Append-only ledgers, reward witness records, market receipts, and multiplayer
  outcomes remain service-owned or cloud-bridge-owned authoritative records.
- A user may delete, duplicate, or roll back private blobs; restore code must
  handle that as an expected input, not as trusted history.
- Game-world key capabilities, DEKs, and KMS decrypt/unwrap grants should not
  be exposed to the browser. For GCP-backed worlds, DEK unwrap and plaintext
  use are KMS/IAM-backed authority granted to the relevant game-world service.
  For local capOS storage, local key backup/recovery is a separate local-host
  policy.

For GCP-backed game-world state, provision one Cloud KMS key ring and symmetric
CryptoKey KEK per world instance or shard. This follows the `CloudKmsKeySource`
envelope model from the cryptography/key-management and volume-encryption
proposals: Cloud KMS wraps or unwraps DEKs, and the game-world service uses the
unwrapped DEK internally as service authority, modeled as a `SymmetricKey`
capability. Grant Cloud KMS roles at the CryptoKey level where possible:
`roles/cloudkms.cryptoKeyEncrypter` for encrypt-only writers that wrap new DEKs,
`roles/cloudkms.cryptoKeyDecrypter` for restore or migration paths that unwrap
existing DEKs, and `roles/cloudkms.cryptoKeyEncrypterDecrypter` only for the
narrow game-world service that genuinely needs both operations. Do not model
browser OAuth identities, Drive/Firebase handles, or capOS clients as holders of
DEKs or KMS decrypt/unwrap grants, and do not rely on per-key-version IAM for
this design.

Key rotation and world retirement are service operations, not browser-vault
features. Rotation creates new Cloud KMS KEK versions for future DEK wrapping
but does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or
disable/destroy old versions. Managed re-encryption or rewrapping must unwrap
the old DEK while its KEK version remains usable, decrypt and validate the
capsule inside the game-world service, then write a new capsule with a new DEK
or a DEK rewrapped by the current primary KEK version. Old KEK versions should
only be disabled or destroyed after inventory proves no accepted wrapped DEK
depends on them. Retiring a world removes IAM decrypt authority first; disabling
key versions can make protected capsules inaccessible, while destruction is
delayed by the scheduled destruction period and irreversible once complete, so
audit retention and recovery must be settled before destruction.

## Phases

### Phase 1: Boot Manifest (parallel with Stage 4)

- [x] Define `SystemManifest` schema in `schema/`
- [x] Build tool (`tools/mkmanifest`) that compiles `system.cue` into a
  capnp-encoded manifest and packs it into the ISO as a Limine module
- [x] Kernel parses the manifest and now creates only the `initConfig.init`
  process
- [x] Focused init-executor manifests pass the manifest to the separate `init`
  binary as bytes through the read-only BootPackage capability
- [x] The separate `init` binary is a generic manifest executor for the default
  `system.cue` path and focused init-executor smokes; focused shell-led smokes
  still use `capos-shell` as `initConfig.init`
- No persistent storage yet — boot image is the only data source

### Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)

**Depends on: IPC (Stage 6) for cross-process cap transfer.** Endpoint, RECV,
RETURN, capability transfer in CALL params, and capability transfer in RETURN
results are already implemented. The `BlockDevice` / `File` / `Directory` /
`DirEntry` / `Store` / `Namespace` schema has now landed in full. The
`File` / `Directory` / `Store` / `Namespace` interfaces also have RAM-backed
kernel `CapObject` implementations (Phase 3 slices 1-3); `BlockDevice` remains
schema-only. Userspace services that export `Directory` / `File` / `Store` /
`Namespace` caps over a real backing store have since landed (Phase 3 below),
and the kernel RAM-backed caps are now qemu-only proof/fixture surface rather
than a production persistence service -- see
[Kernel Storage Cap Backers Are Fixtures](#kernel-storage-cap-backers-are-fixtures).
That history shaped two named downstream adapters:

- POSIX adapter Phase P1.4 (vendored `dash` port) **does not require the
  userspace service** for its v0 smoke: the bootstrap-granted RAM-backed
  `Directory` + `Namespace` kernel caps from Phase 3 slices 1-3 are an
  adequate read-only in-rodata pseudo-fs backing, so P1.4 is now ready to
  start on the userspace `libcapos-posix` file/dir/stdio/env/printf surface
  and on dash vendoring; see [POSIX Adapter](posix-adapter-proposal.md)
  Phase P1.4 and `docs/backlog/posix-adapter-dash-port.md`. P1.3 (pipe +
  recording ProcessSpawner-driven fork-for-exec) landed without storage
  caps, so P1.4 is the next surface that consumes this proposal.
- WASI host adapter Phase W.5 (Preview 1 filesystem) similarly consumes
  the same kernel cap shape and is unblocked from the same cap-surface
  perspective; remaining W.5 work is on the wasi-host adapter side. See
  [WASI Host Adapter](wasi-host-adapter-proposal.md) Phase W.5.

Concrete work:

- [x] Add `BlockDevice`, `File`, `Directory`, and `DirEntry` to
  `schema/capos.capnp`, regenerate the checked-in capnp bindings, add the
  `BLOCKDEVICE_INTERFACE_ID` / `FILE_INTERFACE_ID` / `DIRECTORY_INTERFACE_ID`
  constants, and add a `capos-config` host roundtrip test. This was schema-only
  when it landed; kernel `CapObject` implementations followed in Phase 3
  slices 1-3 (the `Store` / `Namespace` interfaces were added in slice 3).
  `SharedBuffer` is not a separate interface -- bulk transfers reuse the
  existing `MemoryObject` capability, and the inline-`Data` `read` / `write` /
  `readBlocks` / `writeBlocks` variants are the v0 surface.
- [ ] Demo: two-process file server (in-memory File/Directory service + client)
  that the POSIX and WASI adapters can resolve preopens against

### Phase 3: RAM-backed Store (after Phase 2)

**Depends on: IPC (Stage 6) for cross-process store access.** Same downstream
blockers as Phase 2 -- the POSIX adapter v0 plan resolves `/etc` / `/lib`
under a read-only `Namespace` once this lands.

Concrete work:

- [x] Slice 1: minimal RAM-backed `File` `CapObject` (`kernel/src/cap/file.rs`).
  `FileCap` is backed by a single in-kernel `Vec<u8>` byte buffer and
  implements the inline-`Data` surface of the landed `File` interface --
  `read` / `write` / `stat` / `truncate` / `sync` / `close` -- with per-call
  payloads bounded at 64 KiB. `close()` invalidates the cap: the cap-table
  `get_slot` path consults `validate_live()` (which returns `Revoked` once
  closed), and an in-`call()` guard is the defense-in-depth backup, so a
  post-close call fails closed with an application exception. A new
  `KernelCapSource::file` grant source lets a manifest grant the cap; the
  `make run-file-server-smoke` QEMU smoke (`demos/file-server-smoke/`,
  `system-file-server-smoke.cue`) drives write/read/stat/close round-trips and
  asserts the closed-cap rejection. Bulk-buffer / `MemoryObject`-mapped
  variants are later slices.
- [x] Slice 2: minimal RAM-backed `Directory` `CapObject`
  (`kernel/src/cap/directory.rs`). `DirectoryCap` is an in-memory namespace
  (`BTreeMap<String, DirectoryEntry>`, where each entry is a `FileCap` or a
  sub-`DirectoryCap`) implementing the landed `Directory` interface --
  `open` / `list` / `mkdir` / `remove` / `sub`. `open` / `mkdir` / `sub`
  mint a `File` / `Directory` result capability through the existing IPC
  result-cap transfer machinery (no new transfer authority); file read/write
  goes through the transferred `File` caps, never through the `Directory`.
  `remove` deletes an entry and `revoke()`s the backing object so every cap
  already handed out for it fails closed on its next dispatch, and refuses a
  non-empty sub-directory; `close()` invalidates the cap and recursively
  revokes the subtree. `sub()` has no attenuation beyond the structural
  scoping every sub-`Directory` already has -- per-method read-only
  attenuation is deferred. A new `KernelCapSource::directory` grant source
  lets a manifest grant the cap; the `make run-directory-server-smoke` QEMU
  smoke (`demos/directory-server-smoke/`,
  `system-directory-server-smoke.cue`) drives open/list/mkdir/remove/sub
  with cap transfer and asserts the post-remove fail-closed rejection.
- [x] Slice 3: `Store` and `Namespace` interfaces in `schema/capos.capnp`
  plus minimal RAM-backed `Store` / `Namespace` kernel `CapObject`s
  (`kernel/src/cap/store.rs`, `kernel/src/cap/namespace.rs`). The schema
  additions are purely additive (`Store` / `Namespace` interfaces and the
  `store @34` / `namespace @35` `KernelCapSource` ordinals); the
  `STORE_INTERFACE_ID` / `NAMESPACE_INTERFACE_ID` constants and a
  `capos-config` host roundtrip test landed alongside. `StoreCap` is a
  content-addressed blob store (`BTreeMap<[u8; 32], Vec<u8>>` keyed by the
  SHA-256 content hash from `capos_lib::content_hash`) implementing
  `put` / `get` / `has` / `delete`; `put` is idempotent for identical
  content, blob and count bounds keep one `Store` from ballooning the kernel
  heap, and `delete` is kept on the base interface for this focused proof
  (the `StoreAdmin` split and a GC-verified delete remain deferred -- see the
  `delete` note above). `NamespaceCap` is a name->hash binding map
  (`BTreeMap<String, Vec<u8>>` for bindings plus a `BTreeMap<String,
  Arc<NamespaceCap>>` of `sub` children) implementing
  `resolve` / `bind` / `list` / `sub`; `bind` overwrites an existing name
  (mutable references are the point), `sub(prefix)` mints a structurally
  scoped child node and transfers it through the existing IPC result-cap
  machinery (no new transfer authority, idempotent for a repeated prefix),
  and the parent->child recursive `revoke()` reuses the same finite-tree
  lock-ordering invariant `DirectoryCap` documents. The bindings are opaque
  hash bytes -- a `NamespaceCap` does not hold a `StoreCap` reference or
  verify the hash names a live blob in this slice. New
  `KernelCapSource::store` / `KernelCapSource::namespace` grant sources let a
  manifest grant the caps; the `make run-store-namespace-smoke` QEMU smoke
  (`demos/store-namespace-smoke/`, `system-store-namespace-smoke.cue`) drives
  `Store` put/has/get/delete and `Namespace` bind/resolve/list/sub with cap
  transfer and asserts two fail-closed rejections (a `Store.get` of an
  unknown hash and a `Namespace.resolve` of an unbound name).
- [x] Implement `Store` as a userspace service over an exported `Endpoint`,
  moving it out of the kernel data path: a two-process provider->consumer demo
  (`demos/store-service/`, `system-userspace-store-smoke.cue`,
  `make run-userspace-store-smoke`) serves `put`/`get`/`has`/`delete` from an
  in-RAM `BTreeMap<[u8;32], Vec<u8>>` -- no kernel `Store` cap in the data path.
  It mirrors the kernel `StoreCap` blob-count bound and publishes a narrower
  4 KiB service-specific inline blob limit because the endpoint-framed request
  must fit in the service receive buffer; the smoke proves the largest accepted
  inline blob and the first rejected over-limit blob. The client uses the stock
  `capos-rt` `StoreClient` over the service endpoint relabelled to
  `STORE_INTERFACE_ID` via the manifest `expectedInterfaceId`. Still RAM, not
  yet a real store.
- [x] Implement a persistent `Store` + `Namespace` userspace service backed by
  a granted `BlockDevice`, moving the durable serve boundary out of the kernel:
  a three-process demo (`demos/storage-persist-service/`,
  `system-storage-persist-service.cue`, `make run-storage-persist-service`)
  serves `Store` (`put`/`get`/`has`/`delete`/`list`) and `Namespace`
  (`resolve`/`bind`/`list`/`sub`) from a single service that owns the on-disk
  `CAPOSUS1` whole-state snapshot over a virtio-blk `BlockDevice` -- no kernel
  `Store`/`Namespace` cap in the data path. The snapshot stores content-addressed
  blob bytes (keys recomputed and re-verified on load) and name->hash bindings;
  a superblock names the live snapshot length, its content hash, and a
  monotonic generation, and every mutation writes the new payload fully into
  the standby of two alternating A/B payload regions (selected by generation
  parity) and FLUSHes it before the single-sector superblock write flips the
  generation, so the previously committed snapshot survives a crash at any
  write boundary. `Namespace.sub` returns a scoped `Namespace` cap by
  pre-minting a bounded pool of `Namespace`-typed service-object facets of the
  service's own namespace endpoint (each a distinct receiver cookie, minted
  through a spawned sub-helper) and transferring one through the IPC result-cap
  path; scoped calls route back to the same endpoint by cookie. The client
  reaches both interfaces through manifest-granted service caps relabelled to
  `STORE_INTERFACE_ID` / `NAMESPACE_INTERFACE_ID`, and the two-boot
  `make run-storage-persist-service` proves the marker and note objects and
  their bindings survive a reboot (the service reloads them before the second
  boot writes anything) even after the harness garbages the standby payload
  region between the boots, simulating a commit interrupted mid payload write
  (torn-commit recovery proof).
- [x] Serve the result-cap-returning userspace `Directory` + `File` filesystem
  interfaces from userspace: a three-process demo
  (`demos/storage-fs-service/`, `system-storage-fs-service.cue`,
  `make run-userspace-directory-file-smoke`) runs a service (the init process)
  that owns an in-memory filesystem tree and serves `Directory`
  (`open`/`list`/`mkdir`/`remove`/`sub`/`create`/`rename`) and `File`
  (`read`/`write`/`stat`/`truncate`/`sync`/`close`) over a single endpoint,
  dispatched by the call's stamped interface id and receiver-cookie badge -- no
  kernel `readonly_fs`/`writable_fs`/`installable_image` cap in the data path.
  `Directory.open` (`-> File`), `mkdir`/`sub` (`-> Directory`) transfer result
  caps from bounded pools of pre-minted typed service-object facets of the same
  endpoint (minted through the spawned subhelper, each a distinct cookie). The
  client reaches the tree through a writable root (a `Directory` client-endpoint
  facet) and a read-only root (a `Directory` service-object facet over the same
  tree); read-only attenuation is structural -- the read-only root and the
  read-only `File` handles it returns fail mutation methods closed by routing on
  the cookie, not a rights flag. The proof drives the positive surface plus
  fail-closed cases (closed/stale `File` handle, path traversal via `..`/`/`,
  absent paths, read-only mutation, oversize writes). The existing kernel-backed
  WASI filesystem smoke (`make run-wasi-fs`) stays green as the explicitly
  fixture-labeled kernel `Directory`/`File` path. The follow-up cleanup retiring
  the kernel storage cap backers as production routes has landed -- see
  [Kernel Storage Cap Backers Are Fixtures](#kernel-storage-cap-backers-are-fixtures)
  below.
- [x] Backed by RAM (no disk driver yet, data lost on reboot)
- [x] Backed by a real store (persistent userspace service over `BlockDevice`,
  survives reboot)
- [x] Services can store and retrieve capnp objects at runtime
- [x] Demonstrate the naming model with a userspace `Namespace` service
- [x] `Namespace.sub()` returns new caps via IPC cap transfer

### Kernel Storage Cap Backers Are Fixtures

The kernel `Store`, `Namespace`, `File`, `Directory`, `readOnlyFsRoot`,
`persistentStore`, and `writableFsRoot` grant sources were the proof paths that
landed the typed storage interfaces. Now that the userspace services above own
the production serve boundary -- the RAM `Store` service
(`demos/store-service`, `make run-userspace-store-smoke`), the disk-backed
`Store` + `Namespace` service (`demos/storage-persist-service`,
`make run-storage-persist-service`), and the `Directory` + `File` filesystem
service (`demos/storage-fs-service`, `make run-userspace-directory-file-smoke`)
-- the kernel backers are explicitly **proof/fixture surface, not production
storage routes**. Production storage is userspace-served; no production manifest
grants kernel-owned storage state ownership (the default `system.cue` boot
grants none).

The kernel grant sources are gated accordingly:

- The RAM-backed `file` / `directory` / `store` / `namespace` sources are gated
  behind the `qemu` feature in both the bootstrap cap-table builder
  (`kernel/src/cap/mod.rs`) and the `ProcessSpawner` spawn-grant path
  (`kernel/src/cap/process_spawner.rs`). The default non-`qemu` production kernel
  fails closed on these sources. They remain available only as the in-RAM
  pseudo-fs backing for the qemu interface proofs (`make run-store-namespace-smoke`,
  `make run-file-server-smoke`, `make run-directory-server-smoke`,
  `make run-storage-naming`) and for the POSIX/WASI/dash adapter smokes (`make
  run-posix-*`, `make run-wasi-fs`).
- The disk-backed virtio `read_only_fs_root` / `persistent_store` /
  `writable_fs_root` sources (`kernel/src/cap/readonly_fs.rs`,
  `persistent_store.rs`, `writable_fs.rs`) were already gated behind `qemu`
  (with `storage_fat_read` / `cloud_*_over_nvme_proof` variants for the FAT and
  NVMe proof arms) and fail closed in the default production kernel. They back
  the storage regression proofs `make run-storage-fs`, `make run-storage-persist`,
  and `make run-storage-writable` (plus the FAT and NVMe proof targets), which
  stay green as explicitly fixture-labeled kernel paths.

In short: the kernel keeps these backers only as named qemu/cloud-proof
fixtures; a default production build has no kernel storage grant route, so the
typed storage interfaces are served from userspace.

### Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)

- virtio-blk driver (userspace, reuses virtqueue infrastructure from
  networking smoke test)
- `BlockDevice` trait implementation
- FAT filesystem service: wraps BlockDevice, exports Directory/File caps
- SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
- [x] Store service uses BlockDevice for persistence (the persistent userspace
  `Store` + `Namespace` service above, `make run-storage-persist-service`)
- [x] System state survives reboot via the persistent userspace store
  (`make run-storage-persist-service`); manifest restore hints remain future work

### Phase 5: Network Store (after networking)

- Store service can replicate to or fetch from a remote store
- Capability references transparently span machines
- Directory cap backed by a remote filesystem (9P-style)
- Managed cloud bridges can back selected Store/Namespace or app-specific
  SaveStore capabilities without changing caller authority. First target:
  GCP-backed profile/ledger/snapshot storage for the adventure demo, with local
  fake-cloud tests and no provider credentials in ordinary clients.
- User-owned browser transport can store encrypted save capsules in Google Drive
  `appDataFolder` or Firebase user documents. This is for private backup/sync,
  not authoritative shared state.

## Relationship to Other Proposals

- **Networking proposal** — the NIC driver and net stack are services
  described in the manifest, not hardcoded. The store could be backed by
  network storage once networking works. A remote Directory cap (9P over
  capnp) reuses the same File/Directory interfaces.
- **Service architecture proposal** — the manifest replaces code-as-config
  for init. ProcessSpawner, supervision, and cap export work as described
  there, but driven by manifest data instead of compiled Rust code.
  IPC Endpoints are the mechanism for service export.
- **Capability model** — IPC cap transfer (Endpoint + RETURN SQE) is the
  mechanism that makes `open()` and `resolve()` work. SharedBuffer is the
  bulk data path that makes file I/O practical. Both are tracked in
  `docs/roadmap.md` Stage 6.
- **[POSIX Adapter](posix-adapter-proposal.md)** — Phase P1.4 (vendored
  `dash` port) consumes the `Namespace` + `File` + `Directory` cap surface
  defined here; that surface landed as RAM-backed kernel `CapObject`s in
  Phase 3 slices 1-3 and is the v0 backing for the dash smoke's read-only
  in-rodata pseudo-fs. P1.3 (recording-shim pipe + fork-for-exec) has
  already landed without storage caps, so P1.4 is the next adapter
  consumer. The POSIX path resolver,
  `open`/`read`/`write`/`stat`/`unlink`, `/etc` and `/lib` preopen scoping,
  and the dash port itself all sit on this proposal's Phase 2/3 schema.
- **[WASI Host Adapter](wasi-host-adapter-proposal.md)** — Phase W.5 (Preview
  1 filesystem: `fd_read`/`fd_write`/`fd_seek`/`fd_pread`/`fd_pwrite`/
  `fd_filestat_get`/`path_open`/`path_filestat_get`/`path_unlink_file`)
  consumes the same cap shape and is unblocked from the cap-surface side
  (Phase 3 slices 1-3 land the RAM-backed `Directory` / `Namespace` /
  `File` caps). Preopened-dir fds map to `Namespace` caps from the
  manifest; `path_open` resolves through that namespace's
  `Store` / `File` capability. Phases W.2/W.3/W.4 (stdout, argv-grant,
  `random_get`) shipped without storage caps, so W.5 is the next adapter
  consumer alongside POSIX P1.4.
- **[Userspace Binaries](userspace-binaries-proposal.md) Parts 4 and 5** —
  the POSIX adapter (Part 4) and the WASI host adapter (Part 5) both describe
  their filesystem stories as translations onto this proposal's `Namespace` /
  `Directory` / `File` / `Store` surface. Part 4 sketches the
  `Namespace`-rooted POSIX fd table and the `Namespace + Store -> file I/O`
  translation; Part 5 maps each preopened-dir fd to a `Namespace` cap.
- **Adventure game proposal** — profile, expedition, ledger, and content
  persistence use application-level save records through Store/Namespace or an
  app-specific cloud bridge. The game should not persist by snapshotting a live
  process or exposing provider credentials to clients.
- **Cryptography/key-management and volume-encryption proposals** — the
  Cloud KMS path uses envelope encryption. KMS wraps DEKs under KEKs; capOS
  services use local `SymmetricKey` authority for plaintext operations.

## Open Questions

1. **Manifest validation.** How much can the build tool verify statically?
   Cap export names depend on runtime behavior of services. Should services
   declare their exports in their own metadata (like a package manifest)?

2. **Schema evolution.** When a service's capnp interface changes, stored
   objects referencing the old schema need migration. Cap'n Proto has
   backwards-compatible schema evolution, but breaking changes need a story.

3. **Garbage collection.** Content-addressed store accumulates unreferenced
   objects. Who GCs? A separate service with `Store` read + delete authority?
   Reference counting in the namespace layer?

4. **Large objects.** Storing multi-megabyte binaries as single capnp `Data`
   fields is wasteful (capnp allocates contiguously). SharedBuffer partially
   addresses this for I/O, but the Store's `put`/`get` interface still takes
   `Data`. Options: chunked storage (Merkle tree of hashes), a streaming
   `Blob` interface, or SharedBuffer-aware Store methods.

5. **Trust model for the manifest.** The boot manifest has full authority
   to define the system. Who signs it? How do you prevent a tampered ISO
   from granting excessive caps? Secure boot integration?

6. **File locking and concurrent access.** Multiple processes opening the
   same file through the same filesystem service need coordination.
   Options: mandatory locking in the filesystem service (rejects conflicting
   opens), advisory locking via a separate Lock capability, or
   single-writer enforcement at the Directory level (open with exclusive
   flag).

7. **RETURN+RECV atomicity.** When a server posts a RETURN SQE followed by
   a RECV SQE, there must be no window where a client call can arrive but the
   server isn't listening. SQE LINK chaining (RETURN → RECV) should provide
   this atomicity — the kernel processes both SQEs as a unit.
