Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Storage, Naming, and Persistence

What replaces the filesystem in a capability OS where Cap’n Proto is the universal wire format.

The Problem with Filesystems

In Unix, the filesystem is the universal namespace. Everything is a path: /dev/sda, /etc/config, /proc/self/fd/3, /run/dbus/system_bus_socket. Paths are ambient authority — any process can open /etc/passwd if the permission bits allow. The filesystem conflates naming, access control, persistence, and device abstraction into one mechanism.

capOS has capabilities instead of paths. Access control is structural (you can only use what you were granted), not advisory (permission bits checked at open time). This means:

  • No global namespace needed — each process sees only its granted caps
  • No path-based access control — the cap IS the access
  • No distinction between “file”, “device”, “socket” — everything is a typed capability interface

A traditional VFS would reintroduce ambient authority through the back door. Instead, capOS needs a storage and naming model native to capabilities and Cap’n Proto.

Core Insight: Cap’n Proto Everywhere

Cap’n Proto is already used in capOS for:

  • Interface definitions.capnp schemas define capability contracts
  • IPC messages — capability invocations are capnp messages
  • Serialization — capnp wire format crosses process boundaries

If we extend this to storage, then:

  • Stored objects are capnp messages
  • Configuration is capnp structs
  • Binary images are capnp-wrapped blobs
  • The boot manifest is a capnp message describing the initial capability graph

No format conversion anywhere. The same tools (schema compiler, serializer, validator) work for IPC, storage, config, and network transfer.

Architecture

Three Layers

Target architecture after the manifest executor and process-spawner work:

Boot Image (read-only, baked into ISO)
  │
  │  capnp-encoded manifest + binaries
  │
  v
Kernel (creates initial caps from manifest)
  │
  │  grants caps to init
  │
  v
Init (builds live capability graph)
  │
  ├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
  │
  ├──> Store service (capability-native content-addressed storage)
  │      backed by: virtio-blk, RAM, or network
  │
  └──> All other services (receive Directory, Store, or Namespace caps)

Layer 1: Boot Image

The boot image (ISO/disk) contains a capnp-encoded system manifest loaded as a Limine module alongside the kernel. The manifest describes:

struct SystemManifest {
    # Manifest schema version, validated before other fields
    schemaVersion @0 :UInt32;
    # Binaries available at boot, keyed by name
    binaries @1 :List(NamedBlob);
    # Init's config blob: first-process metadata plus service graph
    initConfig @2 :CueValue;
    # Kernel boot parameters
    kernelParams @3 :SystemConfig;
}

struct NamedBlob {
    name @0 :Text;
    data @1 :Data;
}

struct CueValue {
    union {
        null @0 :Void;
        boolean @1 :Bool;
        intValue @2 :Int64;
        uintValue @3 :UInt64;
        text @4 :Text;
        bytes @5 :Data;
        list @6 :List(CueValue);
        fields @7 :List(CueField);
    }
}

struct CueField {
    name @0 :Text;
    value @1 :CueValue;
}

Capability source identity is already structured in the bootstrap manifest, so source selection does not depend on parsing authority strings:

{
    name:                "client"
    expectedInterfaceId: 0xacf0c15a7b2e0041
    source: service: {
        service: "endpoint-server"
        export:  "client"
    }
}

Kernel and service source objects inside initConfig select the authority to grant. The expectedInterfaceId field carries the generated Cap’n Proto interface TYPE_ID and only checks that the granted object speaks the expected schema. It cannot replace source identity: many different objects may expose the same interface while representing different authority.

The build system (Makefile) generates this manifest from a human-authored description and packs it into the ISO as manifest.bin. Current code embeds every SystemManifest.binaries entry into that manifest as NamedBlob data, including the release-built init and smoke-demo ELFs. The kernel now boots only initConfig.init; focused init-executor manifests expose the manifest to the separate init binary as a read-only BootPackage capability, while default shell-led manifests boot capos-shell directly without a BootPackage executor. Remaining cleanup is to narrow the long-term boot package shape after the single-init split.

Using a CueValue tree instead of AnyPointer keeps the manifest directly decodable in no_std userspace without depending on Cap’n Proto reflection.

Transitional Schema Note

ServiceEntry, CapSource::Service, and ServiceEntry.exports are no longer kernel schema fields. ProcessSpawner, copy/move cap transfer, focused init-owned generic manifest execution, the default standalone-init service graph, focused shell-led login smokes, and the 15.4 initConfig schema split are implemented. The current boot manifest shape is:

struct SystemManifest {
    # Manifest schema version, validated before other fields
    schemaVersion @0 :UInt32;
    # Binaries available at boot, keyed by name
    binaries @1 :List(NamedBlob);
    # Init's config blob (replaces the service graph)
    initConfig @2 :CueValue;
    # Kernel boot parameters (serial policy, shell MOTD, feature flags)
    kernelParams @3 :SystemConfig;
}

ServiceEntry / CapRef disappeared from the schema and became plain CUE fields inside initConfig.services. Init reads them at runtime and calls ProcessSpawner directly. validate_manifest_graph, validate_bootstrap_cap_sources, and the remaining transitional service-graph schema are no longer kernel bootstrap checks. They remain in capos-config for mkmanifest and the focused init executor while that executor still accepts the transitional service graph. Kernel bootstrap already uses a first-service cap-table builder rather than the old multi-service resolver. See docs/proposals/service-architecture-proposal.md — “Legacy Manifest Fields After Stage 6” for the deprecation plan.

During the current transition, initConfig.init is still per-manifest launch metadata: it selects the single boot process binary and the kernel-sourced caps for that process. initConfig.services, cross-service cap sources, exports, and restart policy are init-owned configuration for focused executor manifests. Focused harnesses that boot a demo as init keep using that first-process cap bundle until those smokes are migrated behind a fixed generic init.

Layer 2: Kernel Bootstrap

Target design for the kernel’s boot role:

  1. Parse the system manifest (read-only capnp message from Limine module).
  2. Hash the embedded binaries for optional measured-boot attestation.
  3. Create kernel-provided capabilities: Console, Timer, DeviceManager, ProcessSpawner, FrameAllocator, VirtualMemory (per-process), and a read-only BootPackage cap exposing SystemManifest.binaries and initConfig.
  4. Spawn init — exactly one userspace process — with that cap bundle.

Current boot has reached the single-init split and the initConfig schema split. system.cue puts the standalone init binary in initConfig.init for the default service-graph process; init reads BootPackage and starts the shell, remote-session CapSet gateway, and resident services from initConfig.services. Focused shell-led manifests such as system-smoke.cue still put capos-shell in initConfig.init for narrow login proofs. Focused init-executor manifests such as system-spawn.cue also put the separate init binary in initConfig.init; that binary reads BootPackage and spawns the focused demo graph from initConfig.services through ProcessSpawner. The unused kernel resolver has been retired. The remaining cleanup is replacing per-manifest init bundles with a fixed generic-init bootstrap ABI.

Layer 3: Init and the Live Capability Graph

Target init reads initConfig from the BootPackage cap and executes it:

fn main(caps: CapSet) {
    let spawner = caps.get::<ProcessSpawner>("spawner");
    let boot = caps.get::<BootPackage>("boot");
    let config = boot.init_config()?;  // CueValue

    // Walk service entries from the config and spawn in dependency order
    for entry in config.field("services")?.iter()? {
        let binary = boot.binary(entry.field("binary")?.as_str()?)?;
        let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
        let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
        running_services.insert(entry.field("name")?.as_str()?.into(), handle);
    }

    supervisor_loop(&running_services);
}

In this target model, init is a generic manifest executor rather than a hardcoded service graph. The system topology is defined in the boot package’s initConfig, not in init’s source code. Changing what services run means rebuilding the boot image with a different config blob, not recompiling init. Manifest graph resolution stops being a kernel concern.

The current transition uses initConfig.services as the service graph; init reads the BootPackage manifest, validates a metadata-only ManifestBootstrapPlan, resolves kernel and service cap sources, records exported caps, spawns children in manifest order, and waits for their ProcessHandles.

Two Storage Models

capOS supports two complementary storage models, both exposed as typed capabilities:

Filesystem Capabilities (Directory, File)

For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and for POSIX compatibility. A filesystem service wraps a BlockDevice and exports Directory/File capabilities.

BlockDevice (raw sectors)
    │
    └──> Filesystem service (FAT, ext4, ...)
              │
              ├──> Directory caps (namespace over files)
              └──> File caps (read/write byte streams)

This model maps naturally to USB flash drives, NVMe partitions, and network-mounted filesystems. The open() and sub() operations return new capabilities via IPC cap transfer (see “IPC and Capability Transfer” below).

Capability-Native Store (Store, Namespace)

For capOS-native data: configuration, service state, content-addressed object storage. A store service wraps a BlockDevice and exports Store/Namespace capabilities.

BlockDevice (raw sectors)
    │
    └──> Store service
              │
              ├──> Store cap (content-addressed put/get/list inventory)
              └──> Namespace caps (mutable name→hash mappings)

Content-addressing provides automatic deduplication, verifiable integrity, and immutable references. Store.list returns the live inventory of content hashes in that Store, so holders that need crash/reboot recovery can rediscover stored content without a separate mutable root pointer. Namespaces add mutable bindings on top when callers need stable names rather than inventory scans.

Bridging the Two Models

The models are composable. An adapter service can bridge between them:

  • FsStore adapter: exposes a Directory tree as a content-addressed Store (hash each file’s contents, directory listings become capnp-encoded objects)
  • StoreFS adapter: exposes Store/Namespace as a Directory tree (each name maps to a File whose contents are the stored object)
  • Import/export: a utility service reads files from a Directory and stores them in a Store, or materializes Store objects as files in a Directory

In both cases the adapter is a userspace service holding caps to both subsystems. No kernel mechanism needed — just capability composition.

File I/O Interfaces

Directory, File, Store, and Namespace caps may be scoped to a user session, guest profile, anonymous request, or service identity, but the cap remains the authority. POSIX ownership metadata is compatibility data inside these services, not a system-wide authorization channel. See User Identity and Policy.

BlockDevice

Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass storage). The driver receives hardware capabilities (MMIO, IRQ, FrameAllocator for DMA) and exports a BlockDevice cap.

interface BlockDevice {
    readBlocks  @0 (startLba :UInt64, count :UInt32) -> (data :Data);
    writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
    info        @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
    flush       @3 () -> ();
}

For bulk transfers, readBlocks/writeBlocks accept a SharedBuffer capability instead of inline Data (see “Shared Memory for Bulk Data” below). The inline-Data variants work for metadata reads and small operations; the SharedBuffer variants avoid copies for large I/O.

File

Byte-stream access to a single file. Served by filesystem services. Created dynamically when a client calls Directory.open() — the filesystem service creates a File CapObject for the opened file and transfers it to the caller via IPC cap transfer.

interface File {
    read     @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write    @1 (offset :UInt64, data :Data) -> (written :UInt32);
    stat     @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate @3 (length :UInt64) -> ();
    sync     @4 () -> ();
    close    @5 () -> ();
}

close releases the server-side state for this file (open cluster chain cache, dirty buffers). The kernel-side CapTable entry is removed by the system transport via CAP_OP_RELEASE when the local holder releases it; capos-rt owned handles queue local releases on final drop and expose explicit release flushing for ordinary userspace. CapabilityManager is management-only (list(), later grant()); it does not expose a drop() method because ordinary handle lifetime belongs to the transport, not to an application call on the same table that dispatches it.

Attenuation: a read-only File wraps the original and rejects write, truncate, sync calls. An append-only File rejects write at offsets other than the current size.

Directory

Namespace over files on a filesystem. Served by filesystem services. open() and sub() return new capabilities via IPC cap transfer.

interface Directory {
    open    @0 (name :Text, flags :UInt32) -> (file :File);
    list    @1 () -> (entries :List(DirEntry));
    mkdir   @2 (name :Text) -> (dir :Directory);
    remove  @3 (name :Text) -> ();
    sub     @4 (name :Text) -> (dir :Directory);
    create  @5 (name :Text) -> ();
    rename  @6 (from :Text, to :Text) -> ();
}

struct DirEntry {
    name  @0 :Text;
    size  @1 :UInt64;
    isDir @2 :Bool;
}

sub() returns a Directory scoped to a subdirectory — the analog of chroot. The caller cannot traverse upward or see the parent directory. open() with create flags creates a new file if it doesn’t exist.

The flags field in open() is a bitmask: CREATE = 1, TRUNCATE = 2, APPEND = 4. No READ/WRITE flags — those are determined by the Directory cap’s attenuation (a read-only Directory returns read-only Files).

Writable Directory Mutations and the Single-Writer Policy

create @5 makes a new empty file and rename @6 renames an entry within the same parent. Both have additive ordinals so the read-only Directory implementations stay wire-compatible — they simply reject the mutating methods (mkdir/remove/sub/create/rename) fail-closed, the way a read-only File rejects write. Unlike open with CREATE, create fails closed if the name already exists; rename fails closed if the source is absent or the destination already exists, and does not support cross-directory moves.

The first writable filesystem service adopts a fail-closed single-writer policy: a writable filesystem tree admits one writer at a time. The first granted cap to perform a mutation claims the writer slot; a mutation through any other concurrently granted cap fails closed with a typed Failed exception ("writable filesystem rejects a second concurrent writer (single-writer policy)") rather than racing. There is no lease/release lifecycle — the first writer keeps the slot — and list/sub reads are allowed for any holder. This deliberately closes the milestone’s concurrent-writer-policy decision without expanding scope to advisory locks, lock leases, or multi-writer coordination (see Open Question 6). The implementation (kernel/src/cap/writable_fs.rs, proof make run-storage-writable) is now disk-backed: it mounts a CAPOSWF1 sub-volume (a flat node-record array with parent pointers plus a bump-allocated data region) over the kernel-owned virtio-blk driver, keeps the RAM tree as the working copy, and write-through-commits every directory/file mutation in the order data sector → node-record sector → superblock (the ordering commit point), mirroring the disk-backed Store. The persistent Store CAPOSST1 sub-volume co-locates on the same disk image (at LBA 0; the filesystem superblock sits at a fixed higher LBA), so filesystem mutations and store object writes/deletes survive a reboot together — make run-storage-writable boots QEMU twice against one combined image and phase 2 verifies every surviving name, size, content, directory entry, and store object plus the deleted object’s absence.

Unclean-shutdown recovery is proven by make run-storage-writable-recovery. A slot becomes live on the next mount only once the superblock’s bumped node_count is observed, so a forced poweroff in the window between a node record’s durable write and that commit leaves an orphan slot the next mount ignores: the interrupted allocation is atomically absent, never a torn or half-live entry. The proof builds the kernel with the proof-only storage_writable_recovery feature, which arms an induced forced poweroff in exactly that window (recovery_crash_after_record); pass 1 commits durable mutations and a Store survivor and then triggers the window (the harness kill -9s QEMU after the kernel marker), and pass 2 re-mounts and verifies recovery to a consistent tree with the committed state intact, the interrupted allocation absent, no torn record, and a usable post-recovery write. The proof is bounded to that single record-vs-commit window under host-page-cache durability (the virtio driver negotiates no VIRTIO_BLK_F_FLUSH, and a kill -9 preserves the host page cache); it proves the superblock-commit ordering invariant, not a general media crash-consistency guarantee against host power loss or a lost write-back cache. The co-located CAPOSST1 Store now has bounded tombstone reclamation through make run-storage-persist; this does not add a new media power-loss guarantee or reclaim writable-file extents.

Writable File content paths layer onto the same tree. open with the CREATE/TRUNCATE/APPEND flags (or a write through the returned File) claims the same filesystem-wide writer slot, so file writes obey the single writer policy alongside directory mutations; a plain (flags == 0) open and the read/stat methods are reads allowed for any holder. write @1 overwrites or extends at the supplied offset, zero-filling any gap; a handle opened APPEND lands every write at end-of-file regardless of the offset argument. truncate @3 shrinks (discards the tail) or extends (zero-fills) the file, and close @5 releases only that handle — the file survives in the directory until Directory.remove, which marks the file node so any outstanding File cap fails closed. File content is bounded by MAX_FILE_BYTES (64 KiB) and persists to a bump-allocated disk extent on each mutation; a rewrite that outgrows the current extent allocates a fresh one and leaks the old (file-extent compaction deferred). Because each write/truncate already wrote through the block device (the virtio driver negotiates no VIRTIO_BLK_F_FLUSH, so there is no separate media barrier to issue), sync @4 succeeds as an honest write-side no-op (a read-only File still rejects it). Crash consistency rests on the superblock-commit ordering rather than a media barrier: an interrupted allocation is atomically absent on remount (proven by make run-storage-writable-recovery, above). A post-write media-durability flush against a write-back cache (for host power loss, not the guest-side forced poweroff that proof exercises) remains future hardening, not claimed here.

Syscall Trace: Reading a File from a FAT USB Drive

Four userspace processes: App, FAT service, USB mass storage, xHCI driver.

With promise pipelining (one submission):

Cap’n Proto promise pipelining lets the App chain dependent calls without waiting for intermediate results. The App submits a single pipelined request: “open this file, then read from the result”:

# Single pipelined submission (SQEs with PIPELINE flag):
#   call 0: dir.open("report.pdf")         → answer_id=200, user_data=100
#   call 1: answer 200 result_cap[0].read(offset=0, len=4096)

cap_submit([
    {cap=2, method=OPEN, answer=200, user_data=100, params={"report.pdf", flags=0}},
    {cap=PIPELINE(answer=200, result_cap=0), method=READ, user_data=101, params={offset:0, length:4096}},
])
  → kernel routes call 0 to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject, replies with File cap as result cap 0
  → kernel sees pipelined call 1 targeting the File cap from call 0
  → kernel dispatches call 1 to the same FAT service (or direct-invokes
    the new File CapObject if it's a local endpoint)
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → USB mass storage → xHCI → hardware → back up
  ← completion: {data: [4096 bytes]}, File cap installed as cap_id=5

One app-to-kernel transition. The kernel resolves the pipeline dependency internally through the sideband CapTransferResult record at index 0; it does not inspect the Cap’n Proto result payload. The App never needs a userspace round trip for the intermediate File cap, though the cap is installed and usable afterward.

This is a core Cap’n Proto feature: by expressing “call method on the not-yet-resolved result of another call,” the client avoids a round-trip for each link in the chain. For deeper chains (e.g., dir.sub("a").sub("b") .open("file").read(0, 4096)), the savings compound — one submission instead of four sequential syscalls.

The capability-ring version should follow the Cap’n Proto/CapTP prior-art shape captured in Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP: pipelined targets live in answer/result-cap namespaces, not in caller-selected global ids; result-cap metadata stays outside the Cap’n Proto payload; broken answers propagate failure to dependent calls; and answer slots, queued dependent calls, queued bytes, and remote references are charged to bounded resource ledgers. This is design grounding, not an OCapN or Cap’n Web wire-compatibility target.

Without pipelining (two sequential ring submissions):

Without promise pipelining, the App submits two separate CALL SQEs via the ring, blocking on each completion before submitting the next:

# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject for this file
  → FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
  → kernel installs File cap in App's table → cap_id=5
  ← App reads CQE: result={file: cap_index=0}, new_caps=[5]

# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → kernel routes to USB mass storage
      → mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
          → kernel routes to xHCI driver
          → xHCI programs TRBs, waits for interrupt
          ← returns raw sector data
      ← returns sector data
  ← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}

This works but costs two round-trips where pipelining needs one. The synchronous path is useful for simple cases and bootstrapping; pipelining is the intended steady-state model.

In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI) are invisible to the App.

Capability-Native Store

The Store Capability

Once the system is running, persistent storage is provided by a userspace service — the store. It’s backed by a block device (virtio-blk), and exposes a content-addressed object store where objects are capnp messages.

interface Store {
    # Store a capnp message, returns its content hash
    put @0 (data :Data) -> (hash :Data);
    # Retrieve by hash
    get @1 (hash :Data) -> (data :Data);
    # Check existence
    has @2 (hash :Data) -> (exists :Bool);
    # Delete (if caller has authority — see note below)
    delete @3 (hash :Data) -> ();
}

Note on delete: In a content-addressed store, deleting a hash can break references from other namespaces pointing to the same object. delete on the base Store interface is dangerously broad — a StoreAdmin interface (separate from Store) may be more appropriate, with delete restricted to a GC service that can verify no live references exist. Open Question #3 (GC) should be resolved before implementing delete. The attenuation table below lists Store (full) as “Read, write, delete any object” — in practice, most callers should receive a Store attenuated to put/get/has only.

Content-addressed means:

  • Deduplication is automatic (same content = same hash)
  • Integrity is verifiable (hash the data, compare)
  • References between objects are just hashes embedded in capnp messages
  • No mutable paths — “updating a file” means storing a new version and updating the reference

Mutable References: Namespaces

A Namespace capability provides mutable name-to-hash mappings on top of the immutable store:

interface Namespace {
    # Resolve a name to a store hash
    resolve @0 (name :Text) -> (hash :Data);
    # Bind a name to a hash (if caller has write authority)
    bind @1 (name :Text, hash :Data) -> ();
    # List names (if caller has list authority)
    list @2 () -> (names :List(Text));
    # Get a sub-namespace (attenuated — restricted to a prefix)
    sub @3 (prefix :Text) -> (ns :Namespace);
}

A Namespace cap scoped to "config/" can only see and modify names under that prefix. This is the analog of a chroot — but structural, not a kernel hack. The sub() method returns a new Namespace cap via IPC cap transfer.

Future: union composition. The research survey recommends extending Namespace with Plan 9-inspired union semantics — a union(other, mode) method that merges two namespaces with before/after/replace ordering. This adds composability without a global mount table. See research survey §6.

IPC and Capability Transfer

Several storage operations return new capabilities: Directory.open() returns a File, Directory.sub() returns a Directory, Namespace.sub() returns a Namespace. This requires dynamic capability management — the kernel must install new capabilities in a process’s CapTable at runtime as part of IPC.

The Capability Ring

All kernel-userspace interaction goes through a shared-memory ring pair (submission queue + completion queue), inspired by io_uring. SQE opcodes map to capnp-rpc Level 1 message types. The ring is allocated per-process at spawn time and mapped into the process’s address space.

Syscall surface: 2 syscalls. New capabilities, operations, and transfer mechanisms are expressed as new SQE opcodes instead of expanding the syscall ABI.

#SyscallPurpose
1exit(code)Terminate current thread; process exits after its last live thread
2cap_enter(min_complete, timeout_ns)Process pending SQEs, then wait until enough CQEs exist or the timeout expires

Writing SQEs is syscall-free, but ordinary capability CALLs make progress through cap_enter. Timer polling handles non-CALL ring work and only CALL targets that explicitly opt into interrupt-context dispatch. cap_enter flushes pending SQEs and can block the process until min_complete completions are available or a finite timeout expires. An indefinite wait uses timeout_ns = u64::MAX; timeout_ns = 0 keeps the call non-blocking. A future SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path without running arbitrary capability methods from timer interrupt context.

The ring structs and synchronous CALL dispatch are implemented and working. See capos-config/src/ring.rs for the shared ring structs and kernel/src/cap/ring.rs for kernel-side processing.

Ring Layout

One 4 KiB page per process, mapped into both kernel (HHDM) and user space:

┌─────────────────────────┐  offset 0
│ Ring Header              │  SQ/CQ head, tail, mask, flags
├─────────────────────────┤  offset 128
│ SQE Array (16 × 64B)    │  submission queue entries
├─────────────────────────┤  offset 1152
│ CQE Array (32 × 32B)    │  completion queue entries
└─────────────────────────┘

SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)

SQE Opcodes

Five opcodes handle everything — client calls, server dispatch, capability transfer, pipelining, and lifecycle:

Opcodecapnp-rpc analogPurpose
CALLCallInvoke method on a capability
RETURNReturnRespond to incoming call (server side)
RECV(implicit)Wait for incoming calls on Endpoint
RELEASEReleaseDrop a capability reference
FINISHFinishRelease pipeline answer state
TIMEOUTPost a CQE after N nanoseconds (io_uring-inspired)

TIMEOUT is an alternative to the timeout_ns argument on cap_enter: it works with zero-syscall polling (kernel fires the CQE on a timer tick) and composes with LINK/DRAIN for deadline-based chains.

SQE flags: PIPELINE (cap_id is a promise reference), LINK (chain to next SQE), MULTISHOT (keep generating CQEs), DRAIN (barrier).

Promise Pipelining

A CALL SQE can target either a concrete CapId or a PromisedAnswer reference (via the PIPELINE flag + pipeline_dep/pipeline_field fields). pipeline_dep names the earlier answer and pipeline_field is a zero-based CapTransferResult record index in that answer’s sideband result-cap list, not a Cap’n Proto schema field. The kernel resolves the dependency chain internally:

SQE[0]: CALL dir.open("report.pdf")        → answer_id=200, user_data=100
SQE[1]: CALL [PIPELINE: dep=200, result_cap=0].read(0, 4096)  → user_data=101

One cap_enter call. The kernel dispatches SQE[0], resolves result cap record 0 from the completion sideband, and dispatches SQE[1] against it without returning to userspace between steps or parsing the result payload.

The Endpoint Kernel Object

For cross-process IPC, an Endpoint connects client-side proxy caps to a server’s receive loop:

Client's CapTable                                   Server's CapTable
┌─────────────────┐                                 ┌──────────────────┐
│ cap 2: Proxy     │                                 │ cap 0: Endpoint   │
│   → endpoint ────────── Endpoint ◄──── RECV SQE ──│                  │
│   badge: 42      │     (kernel obj)                │                  │
└─────────────────┘                                 └──────────────────┘

The server posts a RECV SQE (with MULTISHOT flag). Incoming calls appear as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id. The server responds by posting a RETURN SQE referencing the call_id.

interface_id is the transported schema ID for the interface being invoked. It should equal the generated TYPE_ID for that capnp interface. cap_id is the authority-bearing table handle; interface_id is only the protocol tag. The target capability entry owns one public interface; method_id selects a method inside that interface, while cap_id identifies the object being invoked. If the same backing state needs another interface, the transport should mint a separate capability entry for that interface rather than letting one handle accept multiple unrelated interface_id values.

Direct-Switch IPC

When a client’s CALL targets a cap served by a blocked server (waiting on RECV), the kernel marks that server as the direct IPC handoff target so the next context-switch path runs the callee before unrelated round-robin work. The current implementation still uses the ordinary saved-context restore path; small-message register transfer remains a future fastpath after measurement. See research survey §2.

Capability Transfer via Ring

Capabilities travel as sideband arrays (CapTransferDescriptor) alongside capnp message bytes:

  • CALL params: params buffer contains the capnp message bytes followed by xfer_cap_count transfer descriptors packed at addr + len, which must be aligned to CAP_TRANSFER_DESCRIPTOR_ALIGNMENT.
  • RETURN results: server result buffers carry the capnp reply bytes and may carry return transfer descriptors on addr + len; the kernel inserts destination capability records in the caller’s result buffer after the normal result bytes. Count is reported in CQE cap_count and those records are written as CapTransferResult { cap_id, interface_id } values at result_addr + result. The requested result buffer (result_len) must be large enough for both normal reply bytes and all appended cap_count records.

xfer_cap_count > 0 with malformed descriptor metadata (bad mode bits, reserved bits, _reserved0, or misalignment) fails closed as CAP_ERR_INVALID_TRANSFER_DESCRIPTOR. Kernels that have not yet enabled transfer handling should return CAP_ERR_TRANSFER_NOT_SUPPORTED for transfer-bearing SQEs.

The capnp wire format’s WirePointerKind::Other encodes capability indices in messages. The sideband arrays map these indices to actual CapIds. The kernel does not parse capnp messages — it transfers a list of caps alongside the opaque message bytes.

Dynamic Capability Management

Every open(), sub(), or resolve() creates and transfers a new capability at runtime. The kernel’s CapTable insert() and remove() are the primitives. Capabilities flow through RETURN SQE sideband arrays (and through the manifest at boot). No separate cap_grant mechanism needed — authority flow follows the ring’s IPC graph.

The CapTable generation counter handles stale references: when a File cap is closed (slot freed, generation bumps), any cached CapId returns StaleGeneration instead of accidentally hitting a new occupant.

Shared Memory for Bulk Data

Copying file data through capnp Data fields works for metadata and small reads, but is impractical for anything above a few KB. A 1 MB read through a capability CALL copies data four times: device → driver heap → capnp message → kernel buffer → client buffer.

SharedBuffer Capability

SharedBuffer is the service-facing name this proposal uses for bulk-transfer buffers. The implemented kernel/user substrate is MemoryObject: a capability backed by physical pages that can be mapped into multiple address spaces simultaneously. Zero copies between processes.

interface MemoryObject {
    # Size and page count of the backing object.
    info @0 () -> (pageCount :UInt32, sizeBytes :UInt64);
    # Map a page-aligned object range into the caller's address space.
    map @1 (hint :UInt64, offset :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    # Unmap a caller-local borrowed mapping backed by this object.
    unmap @2 (addr :UInt64, size :UInt64) -> ();
    # Update caller-local page permissions for a borrowed mapping.
    protect @3 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

The kernel creates MemoryObjects through the existing FrameAllocator capability. Held MemoryObject caps charge the holder’s frame-grant quota; mapped address-space pages are tracked as borrowed pages and keep the same backing alive until unmapped or process teardown. A later SharedBuffer alias or allocator may wrap this ABI for storage/network interfaces, but current code should use MemoryObject directly.

File I/O with SharedBuffer

File and BlockDevice interfaces support both inline-Data and SharedBuffer modes:

# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}

# Large read: caller provides SharedBuffer, server fills it
let buf = frame_alloc.allocContiguous(256);  # 1 MB MemoryObject / SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel

Extended File interface with SharedBuffer support:

interface File {
    read      @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write     @1 (offset :UInt64, data :Data) -> (written :UInt32);
    readBuf   @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
    writeBuf  @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
    stat      @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate  @5 (length :UInt64) -> ();
    sync      @6 () -> ();
    close     @7 () -> ();
}

The readBuf/writeBuf methods accept a SharedBuffer cap, currently a MemoryObject cap transferred via IPC. The server maps the buffer, performs DMA or memory copies into it, then returns. The caller reads directly from the mapped pages.

For BlockDevice, the same pattern applies — the driver maps the SharedBuffer, programs DMA descriptors pointing to its physical pages, and the device writes directly into the shared memory.

When to Use Each Mode

ScenarioMechanismWhy
Reading a 64-byte config valueFile.read() inline DataCopy overhead negligible
Reading a 10 MB binaryFile.readBuf() SharedBufferAvoids 4× copy overhead
FAT directory entry (32 bytes)BlockDevice.readBlocks() inlineSmall metadata read
Streaming video framesFile.readBuf() + ring of SharedBuffersContinuous zero-copy
Network packet buffersSharedBuffer ring between NIC driver and net stackDMA-capable pages

Attenuation

Storage services mint restricted capabilities using wrapper CapObjects:

CapabilityAuthority
Directory (full)Open, list, mkdir, remove, sub
Directory (read-only)Open (returns read-only Files), list, sub only
File (full)Read, write, truncate, sync
File (read-only)Read and stat only
File (append-only)Read, stat, write at end only
Store (full)Read, write, delete any object
Store (read-only)Get and has only
Namespace (full)Resolve, bind, list under prefix
Namespace (read-only)Resolve and list only
Blob (single object)Read one specific hash
SharedBuffer (read-only)Map as read-only (page table: R, no W)

An application that only needs to read its config gets a read-only Directory scoped to its config path. It can’t write, can’t see other apps’ directories, can’t access the raw BlockDevice.

Naming Without Paths

Traditional OS: process opens /var/lib/myapp/data.db — a global path.

capOS: process receives a Directory or Namespace cap at spawn time, opens "data.db" within it. The process has no idea where on disk this lives. It can’t traverse upward. There is no global root.

# Traditional: global path namespace
/
├── etc/
│   └── myapp/
│       └── config.toml
├── var/
│   └── lib/
│       └── myapp/
│           └── data.db
└── sbin/
    └── myapp

# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
  "config" → Directory(read-only, scoped to myapp's config files)
  "data"   → Directory(read-write, scoped to myapp's data files)
  "state"  → Namespace(read-write, scoped to myapp's store objects)
  "log"    → Console cap
  "api"    → HttpEndpoint cap

The process doesn’t know or care about the backing storage layout. It just uses the capabilities it was granted.

Configuration

Build-Time Config (Boot Manifest)

The system manifest is authored at build time. The human-writable source could be any format — TOML, CUE, or even a Makefile target that generates the capnp binary. What matters is that it compiles to a SystemManifest capnp message baked into the ISO.

Example source (TOML, compiled to capnp by a build tool):

[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
    { name = "device_mmio", source = { kernel = "device_mmio" } },
    { name = "interrupt", source = { kernel = "interrupt" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["nic"]

[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
    { name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
    { name = "timer", source = { kernel = "timer" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["net"]

[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
    { name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]

[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
    { name = "api", source = { service = { service = "http-service", export = "api" } } },
    { name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
    { name = "data", source = { service = { service = "store", export = "namespace" } } },
    { name = "log", source = { kernel = "console" } },
]

A build tool validates this against the capnp schemas (does virtio-net actually export "nic"? does http-service support endpoint() minting?) and produces the binary manifest.

Runtime Config (via Store)

Once the store service is running, configuration can be stored there and updated without rebuilding the ISO. The store is just another capability — a config-management service could watch for changes and signal services to reload.

Connection to Network Transparency

If capabilities are the only abstraction, and capnp is the only wire format, then the transport is irrelevant:

  • Local IPC: capnp message copied between address spaces by kernel
  • Local store: capnp message written to block device
  • Remote IPC: capnp message sent over TCP to another machine
  • Remote store: capnp message fetched from a remote store service

A capability reference doesn’t encode where the backing service lives. The kernel (or a proxy) handles routing. This means:

  • A Directory cap could be backed by local FAT or a remote 9P server
  • A Namespace cap could be backed by local storage or a remote store
  • A Fetch cap could route through a local HTTP service or a remote proxy
  • A ProcessSpawner cap could spawn locally or on a remote machine

The system manifest could describe services that run on different machines, and the capability graph spans the network. This is the “network transparency” item in the roadmap — it falls out naturally from the model.

Persistence of the Capability Graph

The live capability graph (which process holds which caps) is ephemeral — it exists in kernel memory and is lost on reboot. The system manifest describes the intended graph, and init rebuilds it on each boot.

For true persistence (resume after reboot without re-initializing):

  1. Each service serializes its state to the store before shutdown
  2. On next boot, the manifest includes “restore from store hash X” hints
  3. Services read their saved state from the store and resume

This is application-level persistence, not kernel-level. The kernel doesn’t snapshot the capability graph — services are responsible for their own state. This avoids the complexity of EROS-style transparent persistence while still allowing stateful services.

Managed Cloud Backing

The local Store/Namespace interfaces define capOS persistence semantics. A cloud backend must be an adapter behind those interfaces, not a new ambient authority path. Services such as the adventure profile, expedition, and ledger services should serialize bounded Cap’n Proto records to a store capability; the caller should not know whether that store is backed by RAM, local disk, or a managed cloud service.

For cloud-first application data, use a narrow bridge service:

capOS service -> Store/Namespace or app-specific SaveStore cap -> Cloud bridge
              -> provider APIs

The bridge owns provider credentials and exposes only typed save/load/append operations. Ordinary clients never receive provider credentials, bucket names, database document paths, or broad write authority.

Recommended GCP mapping for game/profile style state:

  • Firestore Native mode for small mutable indexes and profile summaries that need transactional compare-and-set behavior.
  • Cloud Storage for larger immutable snapshots, evidence blobs, exports, and content-addressed objects. Object versioning and lifecycle policy should bound accidental overwrite recovery and storage growth.
  • Cloud Run for a small HTTPS or capnp-over-HTTP bridge endpoint when capOS cannot yet link provider SDKs directly.
  • Secret Manager for bridge-side service credentials and rotation; secrets do not enter ordinary capOS game clients.

Provider-specific records must still carry capOS-level schema version, content hash or release id, profile/tenant id, monotonic version, size limit, and migration policy. Writes that race on the same mutable profile or checkpoint must use an explicit version precondition and fail closed when stale. Append-only ledgers should append new records with previous-record hashes rather than rewriting history. Local QEMU tests should use a fake cloud bridge that enforces the same stale-write, append-only, wrong-profile, and size-bound rules before any real provider integration is accepted.

User-Owned Browser Transport

Some user data should be portable without giving the capOS service operator a database role over it. For private player backup/sync, a browser can act as the transport to user-owned storage:

capOS save service -> encrypted save capsule -> browser
browser OAuth/Firebase session -> Google Drive appDataFolder or Firebase user doc

This is not the same as the managed cloud bridge above. In the browser-transport model, the user grants Drive/Firebase access to the web app, the browser writes opaque encrypted capsules, and capOS never receives the provider tokens. The encryption key follows the storage domain: local capOS storage uses local capOS-host key material, while GCP-backed game-world state uses Cloud KMS envelope encryption: a per-world or per-shard KMS KEK wraps service-owned DEKs. Google Drive’s appDataFolder is a good fit for app-private backup files because it is hidden from ordinary Drive views and can use the narrow drive.appdata scope. Firebase/Firestore can also carry per-user encrypted capsule documents and provide offline cache/sync behavior, but the backend cannot validate encrypted game semantics beyond metadata and access rules.

Treat user-owned blobs as backup material, not authority:

  • The service validates signatures, profile id, content hash, schema version, monotonic version, previous hash, and size bounds before import.
  • Append-only ledgers, reward witness records, market receipts, and multiplayer outcomes remain service-owned or cloud-bridge-owned authoritative records.
  • A user may delete, duplicate, or roll back private blobs; restore code must handle that as an expected input, not as trusted history.
  • Game-world key capabilities, DEKs, and KMS decrypt/unwrap grants should not be exposed to the browser. For GCP-backed worlds, DEK unwrap and plaintext use are KMS/IAM-backed authority granted to the relevant game-world service. For local capOS storage, local key backup/recovery is a separate local-host policy.

For GCP-backed game-world state, provision one Cloud KMS key ring and symmetric CryptoKey KEK per world instance or shard. This follows the CloudKmsKeySource envelope model from the cryptography/key-management and volume-encryption proposals: Cloud KMS wraps or unwraps DEKs, and the game-world service uses the unwrapped DEK internally as service authority, modeled as a SymmetricKey capability. Grant Cloud KMS roles at the CryptoKey level where possible: roles/cloudkms.cryptoKeyEncrypter for encrypt-only writers that wrap new DEKs, roles/cloudkms.cryptoKeyDecrypter for restore or migration paths that unwrap existing DEKs, and roles/cloudkms.cryptoKeyEncrypterDecrypter only for the narrow game-world service that genuinely needs both operations. Do not model browser OAuth identities, Drive/Firebase handles, or capOS clients as holders of DEKs or KMS decrypt/unwrap grants, and do not rely on per-key-version IAM for this design.

Key rotation and world retirement are service operations, not browser-vault features. Rotation creates new Cloud KMS KEK versions for future DEK wrapping but does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or disable/destroy old versions. Managed re-encryption or rewrapping must unwrap the old DEK while its KEK version remains usable, decrypt and validate the capsule inside the game-world service, then write a new capsule with a new DEK or a DEK rewrapped by the current primary KEK version. Old KEK versions should only be disabled or destroyed after inventory proves no accepted wrapped DEK depends on them. Retiring a world removes IAM decrypt authority first; disabling key versions can make protected capsules inaccessible, while destruction is delayed by the scheduled destruction period and irreversible once complete, so audit retention and recovery must be settled before destruction.

Phases

Phase 1: Boot Manifest (parallel with Stage 4)

  • Define SystemManifest schema in schema/
  • Build tool (tools/mkmanifest) that compiles system.cue into a capnp-encoded manifest and packs it into the ISO as a Limine module
  • Kernel parses the manifest and now creates only the initConfig.init process
  • Focused init-executor manifests pass the manifest to the separate init binary as bytes through the read-only BootPackage capability
  • The separate init binary is a generic manifest executor for the default system.cue path and focused init-executor smokes; focused shell-led smokes still use capos-shell as initConfig.init
  • No persistent storage yet — boot image is the only data source

Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)

Depends on: IPC (Stage 6) for cross-process cap transfer. Endpoint, RECV, RETURN, capability transfer in CALL params, and capability transfer in RETURN results are already implemented. The BlockDevice / File / Directory / DirEntry / Store / Namespace schema has now landed in full. The File / Directory / Store / Namespace interfaces also have RAM-backed kernel CapObject implementations (Phase 3 slices 1-3); BlockDevice remains schema-only. Userspace services that export Directory / File / Store / Namespace caps over a real backing store have since landed (Phase 3 below), and the kernel RAM-backed caps are now qemu-only proof/fixture surface rather than a production persistence service – see Kernel Storage Cap Backers Are Fixtures. That history shaped two named downstream adapters:

  • POSIX adapter Phase P1.4 (vendored dash port) does not require the userspace service for its v0 smoke: the bootstrap-granted RAM-backed Directory + Namespace kernel caps from Phase 3 slices 1-3 are an adequate read-only in-rodata pseudo-fs backing, so P1.4 is now ready to start on the userspace libcapos-posix file/dir/stdio/env/printf surface and on dash vendoring; see POSIX Adapter Phase P1.4 and docs/backlog/posix-adapter-dash-port.md. P1.3 (pipe + recording ProcessSpawner-driven fork-for-exec) landed without storage caps, so P1.4 is the next surface that consumes this proposal.
  • WASI host adapter Phase W.5 (Preview 1 filesystem) similarly consumes the same kernel cap shape and is unblocked from the same cap-surface perspective; remaining W.5 work is on the wasi-host adapter side. See WASI Host Adapter Phase W.5.

Concrete work:

  • Add BlockDevice, File, Directory, and DirEntry to schema/capos.capnp, regenerate the checked-in capnp bindings, add the BLOCKDEVICE_INTERFACE_ID / FILE_INTERFACE_ID / DIRECTORY_INTERFACE_ID constants, and add a capos-config host roundtrip test. This was schema-only when it landed; kernel CapObject implementations followed in Phase 3 slices 1-3 (the Store / Namespace interfaces were added in slice 3). SharedBuffer is not a separate interface – bulk transfers reuse the existing MemoryObject capability, and the inline-Data read / write / readBlocks / writeBlocks variants are the v0 surface.
  • Demo: two-process file server (in-memory File/Directory service + client) that the POSIX and WASI adapters can resolve preopens against

Phase 3: RAM-backed Store (after Phase 2)

Depends on: IPC (Stage 6) for cross-process store access. Same downstream blockers as Phase 2 – the POSIX adapter v0 plan resolves /etc / /lib under a read-only Namespace once this lands.

Concrete work:

  • Slice 1: minimal RAM-backed File CapObject (kernel/src/cap/file.rs). FileCap is backed by a single in-kernel Vec<u8> byte buffer and implements the inline-Data surface of the landed File interface – read / write / stat / truncate / sync / close – with per-call payloads bounded at 64 KiB. close() invalidates the cap: the cap-table get_slot path consults validate_live() (which returns Revoked once closed), and an in-call() guard is the defense-in-depth backup, so a post-close call fails closed with an application exception. A new KernelCapSource::file grant source lets a manifest grant the cap; the make run-file-server-smoke QEMU smoke (demos/file-server-smoke/, system-file-server-smoke.cue) drives write/read/stat/close round-trips and asserts the closed-cap rejection. Bulk-buffer / MemoryObject-mapped variants are later slices.
  • Slice 2: minimal RAM-backed Directory CapObject (kernel/src/cap/directory.rs). DirectoryCap is an in-memory namespace (BTreeMap<String, DirectoryEntry>, where each entry is a FileCap or a sub-DirectoryCap) implementing the landed Directory interface – open / list / mkdir / remove / sub. open / mkdir / sub mint a File / Directory result capability through the existing IPC result-cap transfer machinery (no new transfer authority); file read/write goes through the transferred File caps, never through the Directory. remove deletes an entry and revoke()s the backing object so every cap already handed out for it fails closed on its next dispatch, and refuses a non-empty sub-directory; close() invalidates the cap and recursively revokes the subtree. sub() has no attenuation beyond the structural scoping every sub-Directory already has – per-method read-only attenuation is deferred. A new KernelCapSource::directory grant source lets a manifest grant the cap; the make run-directory-server-smoke QEMU smoke (demos/directory-server-smoke/, system-directory-server-smoke.cue) drives open/list/mkdir/remove/sub with cap transfer and asserts the post-remove fail-closed rejection.
  • Slice 3: Store and Namespace interfaces in schema/capos.capnp plus minimal RAM-backed Store / Namespace kernel CapObjects (kernel/src/cap/store.rs, kernel/src/cap/namespace.rs). The schema additions are purely additive (Store / Namespace interfaces and the store @34 / namespace @35 KernelCapSource ordinals); the STORE_INTERFACE_ID / NAMESPACE_INTERFACE_ID constants and a capos-config host roundtrip test landed alongside. StoreCap is a content-addressed blob store (BTreeMap<[u8; 32], Vec<u8>> keyed by the SHA-256 content hash from capos_lib::content_hash) implementing put / get / has / delete; put is idempotent for identical content, blob and count bounds keep one Store from ballooning the kernel heap, and delete is kept on the base interface for this focused proof (the StoreAdmin split and a GC-verified delete remain deferred – see the delete note above). NamespaceCap is a name->hash binding map (BTreeMap<String, Vec<u8>> for bindings plus a BTreeMap<String, Arc<NamespaceCap>> of sub children) implementing resolve / bind / list / sub; bind overwrites an existing name (mutable references are the point), sub(prefix) mints a structurally scoped child node and transfers it through the existing IPC result-cap machinery (no new transfer authority, idempotent for a repeated prefix), and the parent->child recursive revoke() reuses the same finite-tree lock-ordering invariant DirectoryCap documents. The bindings are opaque hash bytes – a NamespaceCap does not hold a StoreCap reference or verify the hash names a live blob in this slice. New KernelCapSource::store / KernelCapSource::namespace grant sources let a manifest grant the caps; the make run-store-namespace-smoke QEMU smoke (demos/store-namespace-smoke/, system-store-namespace-smoke.cue) drives Store put/has/get/delete and Namespace bind/resolve/list/sub with cap transfer and asserts two fail-closed rejections (a Store.get of an unknown hash and a Namespace.resolve of an unbound name).
  • Implement Store as a userspace service over an exported Endpoint, moving it out of the kernel data path: a two-process provider->consumer demo (demos/store-service/, system-userspace-store-smoke.cue, make run-userspace-store-smoke) serves put/get/has/delete from an in-RAM BTreeMap<[u8;32], Vec<u8>> – no kernel Store cap in the data path. It mirrors the kernel StoreCap blob-count bound and publishes a narrower 4 KiB service-specific inline blob limit because the endpoint-framed request must fit in the service receive buffer; the smoke proves the largest accepted inline blob and the first rejected over-limit blob. The client uses the stock capos-rt StoreClient over the service endpoint relabelled to STORE_INTERFACE_ID via the manifest expectedInterfaceId. Still RAM, not yet a real store.
  • Implement a persistent Store + Namespace userspace service backed by a granted BlockDevice, moving the durable serve boundary out of the kernel: a three-process demo (demos/storage-persist-service/, system-storage-persist-service.cue, make run-storage-persist-service) serves Store (put/get/has/delete/list) and Namespace (resolve/bind/list/sub) from a single service that owns the on-disk CAPOSUS1 whole-state snapshot over a virtio-blk BlockDevice – no kernel Store/Namespace cap in the data path. The snapshot stores content-addressed blob bytes (keys recomputed and re-verified on load) and name->hash bindings; a superblock names the live snapshot length, its content hash, and a monotonic generation, and every mutation writes the new payload fully into the standby of two alternating A/B payload regions (selected by generation parity) and FLUSHes it before the single-sector superblock write flips the generation, so the previously committed snapshot survives a crash at any write boundary. Namespace.sub returns a scoped Namespace cap by pre-minting a bounded pool of Namespace-typed service-object facets of the service’s own namespace endpoint (each a distinct receiver cookie, minted through a spawned sub-helper) and transferring one through the IPC result-cap path; scoped calls route back to the same endpoint by cookie. The client reaches both interfaces through manifest-granted service caps relabelled to STORE_INTERFACE_ID / NAMESPACE_INTERFACE_ID, and the two-boot make run-storage-persist-service proves the marker and note objects and their bindings survive a reboot (the service reloads them before the second boot writes anything) even after the harness garbages the standby payload region between the boots, simulating a commit interrupted mid payload write (torn-commit recovery proof).
  • Serve the result-cap-returning userspace Directory + File filesystem interfaces from userspace: a three-process demo (demos/storage-fs-service/, system-storage-fs-service.cue, make run-userspace-directory-file-smoke) runs a service (the init process) that owns an in-memory filesystem tree and serves Directory (open/list/mkdir/remove/sub/create/rename) and File (read/write/stat/truncate/sync/close) over a single endpoint, dispatched by the call’s stamped interface id and receiver-cookie badge – no kernel readonly_fs/writable_fs/installable_image cap in the data path. Directory.open (-> File), mkdir/sub (-> Directory) transfer result caps from bounded pools of pre-minted typed service-object facets of the same endpoint (minted through the spawned subhelper, each a distinct cookie). The client reaches the tree through a writable root (a Directory client-endpoint facet) and a read-only root (a Directory service-object facet over the same tree); read-only attenuation is structural – the read-only root and the read-only File handles it returns fail mutation methods closed by routing on the cookie, not a rights flag. The proof drives the positive surface plus fail-closed cases (closed/stale File handle, path traversal via ..//, absent paths, read-only mutation, oversize writes). The existing kernel-backed WASI filesystem smoke (make run-wasi-fs) stays green as the explicitly fixture-labeled kernel Directory/File path. The follow-up cleanup retiring the kernel storage cap backers as production routes has landed – see Kernel Storage Cap Backers Are Fixtures below.
  • Backed by RAM (no disk driver yet, data lost on reboot)
  • Backed by a real store (persistent userspace service over BlockDevice, survives reboot)
  • Services can store and retrieve capnp objects at runtime
  • Demonstrate the naming model with a userspace Namespace service
  • Namespace.sub() returns new caps via IPC cap transfer

Kernel Storage Cap Backers Are Fixtures

The kernel Store, Namespace, File, Directory, readOnlyFsRoot, persistentStore, and writableFsRoot grant sources were the proof paths that landed the typed storage interfaces. Now that the userspace services above own the production serve boundary – the RAM Store service (demos/store-service, make run-userspace-store-smoke), the disk-backed Store + Namespace service (demos/storage-persist-service, make run-storage-persist-service), and the Directory + File filesystem service (demos/storage-fs-service, make run-userspace-directory-file-smoke) – the kernel backers are explicitly proof/fixture surface, not production storage routes. Production storage is userspace-served; no production manifest grants kernel-owned storage state ownership (the default system.cue boot grants none).

The kernel grant sources are gated accordingly:

  • The RAM-backed file / directory / store / namespace sources are gated behind the qemu feature in both the bootstrap cap-table builder (kernel/src/cap/mod.rs) and the ProcessSpawner spawn-grant path (kernel/src/cap/process_spawner.rs). The default non-qemu production kernel fails closed on these sources. They remain available only as the in-RAM pseudo-fs backing for the qemu interface proofs (make run-store-namespace-smoke, make run-file-server-smoke, make run-directory-server-smoke, make run-storage-naming) and for the POSIX/WASI/dash adapter smokes (make run-posix-*, make run-wasi-fs).
  • The disk-backed virtio read_only_fs_root / persistent_store / writable_fs_root sources (kernel/src/cap/readonly_fs.rs, persistent_store.rs, writable_fs.rs) were already gated behind qemu (with storage_fat_read / cloud_*_over_nvme_proof variants for the FAT and NVMe proof arms) and fail closed in the default production kernel. They back the storage regression proofs make run-storage-fs, make run-storage-persist, and make run-storage-writable (plus the FAT and NVMe proof targets), which stay green as explicitly fixture-labeled kernel paths.

In short: the kernel keeps these backers only as named qemu/cloud-proof fixtures; a default production build has no kernel storage grant route, so the typed storage interfaces are served from userspace.

Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)

  • virtio-blk driver (userspace, reuses virtqueue infrastructure from networking smoke test)
  • BlockDevice trait implementation
  • FAT filesystem service: wraps BlockDevice, exports Directory/File caps
  • SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
  • Store service uses BlockDevice for persistence (the persistent userspace Store + Namespace service above, make run-storage-persist-service)
  • System state survives reboot via the persistent userspace store (make run-storage-persist-service); manifest restore hints remain future work

Phase 5: Network Store (after networking)

  • Store service can replicate to or fetch from a remote store
  • Capability references transparently span machines
  • Directory cap backed by a remote filesystem (9P-style)
  • Managed cloud bridges can back selected Store/Namespace or app-specific SaveStore capabilities without changing caller authority. First target: GCP-backed profile/ledger/snapshot storage for the adventure demo, with local fake-cloud tests and no provider credentials in ordinary clients.
  • User-owned browser transport can store encrypted save capsules in Google Drive appDataFolder or Firebase user documents. This is for private backup/sync, not authoritative shared state.

Relationship to Other Proposals

  • Networking proposal — the NIC driver and net stack are services described in the manifest, not hardcoded. The store could be backed by network storage once networking works. A remote Directory cap (9P over capnp) reuses the same File/Directory interfaces.
  • Service architecture proposal — the manifest replaces code-as-config for init. ProcessSpawner, supervision, and cap export work as described there, but driven by manifest data instead of compiled Rust code. IPC Endpoints are the mechanism for service export.
  • Capability model — IPC cap transfer (Endpoint + RETURN SQE) is the mechanism that makes open() and resolve() work. SharedBuffer is the bulk data path that makes file I/O practical. Both are tracked in docs/roadmap.md Stage 6.
  • POSIX Adapter — Phase P1.4 (vendored dash port) consumes the Namespace + File + Directory cap surface defined here; that surface landed as RAM-backed kernel CapObjects in Phase 3 slices 1-3 and is the v0 backing for the dash smoke’s read-only in-rodata pseudo-fs. P1.3 (recording-shim pipe + fork-for-exec) has already landed without storage caps, so P1.4 is the next adapter consumer. The POSIX path resolver, open/read/write/stat/unlink, /etc and /lib preopen scoping, and the dash port itself all sit on this proposal’s Phase 2/3 schema.
  • WASI Host Adapter — Phase W.5 (Preview 1 filesystem: fd_read/fd_write/fd_seek/fd_pread/fd_pwrite/ fd_filestat_get/path_open/path_filestat_get/path_unlink_file) consumes the same cap shape and is unblocked from the cap-surface side (Phase 3 slices 1-3 land the RAM-backed Directory / Namespace / File caps). Preopened-dir fds map to Namespace caps from the manifest; path_open resolves through that namespace’s Store / File capability. Phases W.2/W.3/W.4 (stdout, argv-grant, random_get) shipped without storage caps, so W.5 is the next adapter consumer alongside POSIX P1.4.
  • Userspace Binaries Parts 4 and 5 — the POSIX adapter (Part 4) and the WASI host adapter (Part 5) both describe their filesystem stories as translations onto this proposal’s Namespace / Directory / File / Store surface. Part 4 sketches the Namespace-rooted POSIX fd table and the Namespace + Store -> file I/O translation; Part 5 maps each preopened-dir fd to a Namespace cap.
  • Adventure game proposal — profile, expedition, ledger, and content persistence use application-level save records through Store/Namespace or an app-specific cloud bridge. The game should not persist by snapshotting a live process or exposing provider credentials to clients.
  • Cryptography/key-management and volume-encryption proposals — the Cloud KMS path uses envelope encryption. KMS wraps DEKs under KEKs; capOS services use local SymmetricKey authority for plaintext operations.

Open Questions

  1. Manifest validation. How much can the build tool verify statically? Cap export names depend on runtime behavior of services. Should services declare their exports in their own metadata (like a package manifest)?

  2. Schema evolution. When a service’s capnp interface changes, stored objects referencing the old schema need migration. Cap’n Proto has backwards-compatible schema evolution, but breaking changes need a story.

  3. Garbage collection. Content-addressed store accumulates unreferenced objects. Who GCs? A separate service with Store read + delete authority? Reference counting in the namespace layer?

  4. Large objects. Storing multi-megabyte binaries as single capnp Data fields is wasteful (capnp allocates contiguously). SharedBuffer partially addresses this for I/O, but the Store’s put/get interface still takes Data. Options: chunked storage (Merkle tree of hashes), a streaming Blob interface, or SharedBuffer-aware Store methods.

  5. Trust model for the manifest. The boot manifest has full authority to define the system. Who signs it? How do you prevent a tampered ISO from granting excessive caps? Secure boot integration?

  6. File locking and concurrent access. Multiple processes opening the same file through the same filesystem service need coordination. Options: mandatory locking in the filesystem service (rejects conflicting opens), advisory locking via a separate Lock capability, or single-writer enforcement at the Directory level (open with exclusive flag).

  7. RETURN+RECV atomicity. When a server posts a RETURN SQE followed by a RECV SQE, there must be no window where a client call can arrive but the server isn’t listening. SQE LINK chaining (RETURN → RECV) should provide this atomicity — the kernel processes both SQEs as a unit.