Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Storage, Naming, and Persistence

What replaces the filesystem in a capability OS where Cap’n Proto is the universal wire format.

The Problem with Filesystems

In Unix, the filesystem is the universal namespace. Everything is a path: /dev/sda, /etc/config, /proc/self/fd/3, /run/dbus/system_bus_socket. Paths are ambient authority — any process can open /etc/passwd if the permission bits allow. The filesystem conflates naming, access control, persistence, and device abstraction into one mechanism.

capOS has capabilities instead of paths. Access control is structural (you can only use what you were granted), not advisory (permission bits checked at open time). This means:

  • No global namespace needed — each process sees only its granted caps
  • No path-based access control — the cap IS the access
  • No distinction between “file”, “device”, “socket” — everything is a typed capability interface

A traditional VFS would reintroduce ambient authority through the back door. Instead, capOS needs a storage and naming model native to capabilities and Cap’n Proto.

Core Insight: Cap’n Proto Everywhere

Cap’n Proto is already used in capOS for:

  • Interface definitions.capnp schemas define capability contracts
  • IPC messages — capability invocations are capnp messages
  • Serialization — capnp wire format crosses process boundaries

If we extend this to storage, then:

  • Stored objects are capnp messages
  • Configuration is capnp structs
  • Binary images are capnp-wrapped blobs
  • The boot manifest is a capnp message describing the initial capability graph

No format conversion anywhere. The same tools (schema compiler, serializer, validator) work for IPC, storage, config, and network transfer.

Architecture

Three Layers

Target architecture after the manifest executor and process-spawner work:

Boot Image (read-only, baked into ISO)
  │
  │  capnp-encoded manifest + binaries
  │
  v
Kernel (creates initial caps from manifest)
  │
  │  grants caps to init
  │
  v
Init (builds live capability graph)
  │
  ├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
  │
  ├──> Store service (capability-native content-addressed storage)
  │      backed by: virtio-blk, RAM, or network
  │
  └──> All other services (receive Directory, Store, or Namespace caps)

Layer 1: Boot Image

The boot image (ISO/disk) contains a capnp-encoded system manifest loaded as a Limine module alongside the kernel. The manifest describes:

struct SystemManifest {
    # Binaries available at boot, keyed by name
    binaries @0 :List(NamedBlob);
    # Initial service graph — what to spawn and with what caps
    services @1 :List(ServiceEntry);
    # Static configuration values as an evaluated CUE-style tree
    config @2 :CueValue;
}

struct NamedBlob {
    name @0 :Text;
    data @1 :Data;
}

struct ServiceEntry {
    name @0 :Text;
    binary @1 :Text;          # references a NamedBlob by name
    caps @2 :List(CapRef);    # what caps this service receives
    restart @3 :RestartPolicy;
    exports @4 :List(Text);   # cap names this service is expected to export
}

struct CapRef {
    name @0 :Text;                 # local name in the child's cap table
    expectedInterfaceId @1 :UInt64; # generated .capnp TYPE_ID for validation
    union {
        unset @2 :Void;             # invalid; keeps omitted sources fail-closed
        kernel @3 :KernelCapSource;
        service @4 :ServiceCapSource;
    }
}

enum KernelCapSource {
    console @0;
    endpoint @1;
    frameAllocator @2;
    virtualMemory @3;
}

struct ServiceCapSource {
    service @0 :Text;
    export @1 :Text;
}

enum RestartPolicy {
    never @0;
    onFailure @1;
    always @2;
}

struct CueValue {
    union {
        null @0 :Void;
        boolean @1 :Bool;
        intValue @2 :Int64;
        uintValue @3 :UInt64;
        text @4 :Text;
        bytes @5 :Data;
        list @6 :List(CueValue);
        fields @7 :List(CueField);
    }
}

struct CueField {
    name @0 :Text;
    value @1 :CueValue;
}

Capability source identity is already structured in the bootstrap manifest, so source selection does not depend on parsing authority strings:

struct CapRef {
    name @0 :Text;                 # local name in the child's CapSet
    expectedInterfaceId @1 :UInt64; # generated .capnp TYPE_ID for validation
    union {
        unset @2 :Void;             # invalid; keeps omitted sources fail-closed
        kernel @3 :KernelCapSource;
        service @4 :ServiceCapSource;
    }
}

enum KernelCapSource {
    console @0;
    endpoint @1;
    frameAllocator @2;
    virtualMemory @3;
}

struct ServiceCapSource {
    service @0 :Text;
    export @1 :Text;
}

KernelCapSource / ServiceCapSource select the authority to grant. The expectedInterfaceId field carries the generated Cap’n Proto interface TYPE_ID and only checks that the granted object speaks the expected schema. It cannot replace source identity: many different objects may expose the same interface while representing different authority.

The build system (Makefile) generates this manifest from a human-authored description and packs it into the ISO as manifest.bin. Current code embeds every SystemManifest.binaries entry into that manifest as NamedBlob data, including the release-built init and smoke-demo ELFs. Exposing the manifest to init as a read-only BootPackage capability (rather than letting the kernel parse and act on the service graph) is the selected follow-on milestone.

Using a CueValue tree instead of AnyPointer keeps the manifest directly decodable in no_std userspace without depending on Cap’n Proto reflection.

Transitional Schema Note

ServiceEntry, CapSource::Service, and ServiceEntry.exports are transitional. ProcessSpawner and copy/move cap transfer are implemented (2026-04-22), but the default make run boot path still has the kernel spawn every declared service and wire cross-service caps. Once init owns generic manifest execution, the manifest loses the service graph entirely:

struct SystemManifest {
    # Binaries available at boot, keyed by name
    binaries @0 :List(NamedBlob);
    # Init's config blob (replaces the service graph)
    initConfig @1 :CueValue;
    # Kernel boot parameters (memory limits, feature flags)
    kernelParams @2 :CueValue;
}

ServiceEntry / CapRef disappear from the schema and become plain CUE fields inside initConfig. Init reads them at runtime and calls ProcessSpawner directly. validate_manifest_graph, validate_bootstrap_cap_sources, and create_all_service_caps all retire once that happens. See docs/proposals/service-architecture-proposal.md — “Legacy Manifest Fields After Stage 6” for the deprecation plan.

Layer 2: Kernel Bootstrap

Target design for the kernel’s boot role:

  1. Parse the system manifest (read-only capnp message from Limine module).
  2. Hash the embedded binaries for optional measured-boot attestation.
  3. Create kernel-provided capabilities: Console, Timer, DeviceManager, ProcessSpawner, FrameAllocator, VirtualMemory (per-process), and a read-only BootPackage cap exposing SystemManifest.binaries and initConfig.
  4. Spawn init — exactly one userspace process — with that cap bundle.

Current code has not reached this split for the default boot: the kernel still parses the manifest and creates one process per ServiceEntry. The transition path exists in system-spawn.cue: it sets config.initExecutesManifest, the kernel validates the full manifest but boots only init, and init spawns endpoint, IPC, VirtualMemory, and FrameAllocator cleanup demo children through ProcessSpawner. Retiring the legacy kernel resolver for default make run is the selected follow-on milestone tracked in WORKPLAN.md.

Layer 3: Init and the Live Capability Graph

Target init reads initConfig from the BootPackage cap and executes it:

fn main(caps: CapSet) {
    let spawner = caps.get::<ProcessSpawner>("spawner");
    let boot = caps.get::<BootPackage>("boot");
    let config = boot.init_config()?;  // CueValue

    // Walk service entries from the config and spawn in dependency order
    for entry in config.field("services")?.iter()? {
        let binary = boot.binary(entry.field("binary")?.as_str()?)?;
        let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
        let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
        running_services.insert(entry.field("name")?.as_str()?.into(), handle);
    }

    supervisor_loop(&running_services);
}

In this target model, init is a generic manifest executor rather than a hardcoded service graph. The system topology is defined in the boot package’s initConfig, not in init’s source code. Changing what services run means rebuilding the boot image with a different config blob, not recompiling init. Manifest graph resolution stops being a kernel concern.

The current transition still uses SystemManifest.services as the service graph instead of initConfig; init reads the BootPackage manifest, validates a metadata-only ManifestBootstrapPlan, resolves kernel and service cap sources, records exported caps, spawns children in manifest order, and waits for their ProcessHandles.

Two Storage Models

capOS supports two complementary storage models, both exposed as typed capabilities:

Filesystem Capabilities (Directory, File)

For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and for POSIX compatibility. A filesystem service wraps a BlockDevice and exports Directory/File capabilities.

BlockDevice (raw sectors)
    │
    └──> Filesystem service (FAT, ext4, ...)
              │
              ├──> Directory caps (namespace over files)
              └──> File caps (read/write byte streams)

This model maps naturally to USB flash drives, NVMe partitions, and network-mounted filesystems. The open() and sub() operations return new capabilities via IPC cap transfer (see “IPC and Capability Transfer” below).

Capability-Native Store (Store, Namespace)

For capOS-native data: configuration, service state, content-addressed object storage. A store service wraps a BlockDevice and exports Store/Namespace capabilities.

BlockDevice (raw sectors)
    │
    └──> Store service
              │
              ├──> Store cap (content-addressed put/get)
              └──> Namespace caps (mutable name→hash mappings)

Content-addressing provides automatic deduplication, verifiable integrity, and immutable references. Namespaces add mutable bindings on top.

Bridging the Two Models

The models are composable. An adapter service can bridge between them:

  • FsStore adapter: exposes a Directory tree as a content-addressed Store (hash each file’s contents, directory listings become capnp-encoded objects)
  • StoreFS adapter: exposes Store/Namespace as a Directory tree (each name maps to a File whose contents are the stored object)
  • Import/export: a utility service reads files from a Directory and stores them in a Store, or materializes Store objects as files in a Directory

In both cases the adapter is a userspace service holding caps to both subsystems. No kernel mechanism needed — just capability composition.

File I/O Interfaces

Directory, File, Store, and Namespace caps may be scoped to a user session, guest profile, anonymous request, or service identity, but the cap remains the authority. POSIX ownership metadata is compatibility data inside these services, not a system-wide authorization channel. See user-identity-and-policy-proposal.md.

BlockDevice

Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass storage). The driver receives hardware capabilities (MMIO, IRQ, FrameAllocator for DMA) and exports a BlockDevice cap.

interface BlockDevice {
    readBlocks  @0 (startLba :UInt64, count :UInt32) -> (data :Data);
    writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
    info        @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
    flush       @3 () -> ();
}

For bulk transfers, readBlocks/writeBlocks accept a SharedBuffer capability instead of inline Data (see “Shared Memory for Bulk Data” below). The inline-Data variants work for metadata reads and small operations; the SharedBuffer variants avoid copies for large I/O.

File

Byte-stream access to a single file. Served by filesystem services. Created dynamically when a client calls Directory.open() — the filesystem service creates a File CapObject for the opened file and transfers it to the caller via IPC cap transfer.

interface File {
    read     @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write    @1 (offset :UInt64, data :Data) -> (written :UInt32);
    stat     @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate @3 (length :UInt64) -> ();
    sync     @4 () -> ();
    close    @5 () -> ();
}

close releases the server-side state for this file (open cluster chain cache, dirty buffers). The kernel-side CapTable entry is removed by the system transport via CAP_OP_RELEASE when the local holder releases it; generated capos-rt handle drop still needs RAII integration before ordinary userspace handles submit that opcode automatically. CapabilityManager is management-only (list(), later grant()); it does not expose a drop() method because ordinary handle lifetime belongs to the transport, not to an application call on the same table that dispatches it.

Attenuation: a read-only File wraps the original and rejects write, truncate, sync calls. An append-only File rejects write at offsets other than the current size.

Directory

Namespace over files on a filesystem. Served by filesystem services. open() and sub() return new capabilities via IPC cap transfer.

interface Directory {
    open    @0 (name :Text, flags :UInt32) -> (file :File);
    list    @1 () -> (entries :List(DirEntry));
    mkdir   @2 (name :Text) -> (dir :Directory);
    remove  @3 (name :Text) -> ();
    sub     @4 (name :Text) -> (dir :Directory);
}

struct DirEntry {
    name  @0 :Text;
    size  @1 :UInt64;
    isDir @2 :Bool;
}

sub() returns a Directory scoped to a subdirectory — the analog of chroot. The caller cannot traverse upward or see the parent directory. open() with create flags creates a new file if it doesn’t exist.

The flags field in open() is a bitmask: CREATE = 1, TRUNCATE = 2, APPEND = 4. No READ/WRITE flags — those are determined by the Directory cap’s attenuation (a read-only Directory returns read-only Files).

Syscall Trace: Reading a File from a FAT USB Drive

Four userspace processes: App, FAT service, USB mass storage, xHCI driver.

With promise pipelining (one submission):

Cap’n Proto promise pipelining lets the App chain dependent calls without waiting for intermediate results. The App submits a single pipelined request: “open this file, then read from the result”:

# Single pipelined submission (SQEs with PIPELINE flag):
#   call 0: dir.open("report.pdf")         → promise P0
#   call 1: P0.file.read(offset=0, len=4096)  → depends on P0

cap_submit([
    {cap=2, method=OPEN, params={"report.pdf", flags=0}},
    {cap=PIPELINE(0, field=file), method=READ, params={offset:0, length:4096}},
])
  → kernel routes call 0 to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject, replies with File cap
  → kernel sees pipelined call 1 targeting the File cap from call 0
  → kernel dispatches call 1 to the same FAT service (or direct-invokes
    the new File CapObject if it's a local endpoint)
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → USB mass storage → xHCI → hardware → back up
  ← completion: {data: [4096 bytes]}, File cap installed as cap_id=5

One app-to-kernel transition. The kernel resolves the pipeline dependency internally — the App never sees the intermediate File cap until the whole chain completes (though the cap is installed and usable afterward).

This is a core Cap’n Proto feature: by expressing “call method on the not-yet-resolved result of another call,” the client avoids a round-trip for each link in the chain. For deeper chains (e.g., dir.sub("a").sub("b") .open("file").read(0, 4096)), the savings compound — one submission instead of four sequential syscalls.

Without pipelining (two sequential ring submissions):

Without promise pipelining, the App submits two separate CALL SQEs via the ring, blocking on each completion before submitting the next:

# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject for this file
  → FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
  → kernel installs File cap in App's table → cap_id=5
  ← App reads CQE: result={file: cap_index=0}, new_caps=[5]

# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → kernel routes to USB mass storage
      → mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
          → kernel routes to xHCI driver
          → xHCI programs TRBs, waits for interrupt
          ← returns raw sector data
      ← returns sector data
  ← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}

This works but costs two round-trips where pipelining needs one. The synchronous path is useful for simple cases and bootstrapping; pipelining is the intended steady-state model.

In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI) are invisible to the App.

Capability-Native Store

The Store Capability

Once the system is running, persistent storage is provided by a userspace service — the store. It’s backed by a block device (virtio-blk), and exposes a content-addressed object store where objects are capnp messages.

interface Store {
    # Store a capnp message, returns its content hash
    put @0 (data :Data) -> (hash :Data);
    # Retrieve by hash
    get @1 (hash :Data) -> (data :Data);
    # Check existence
    has @2 (hash :Data) -> (exists :Bool);
    # Delete (if caller has authority — see note below)
    delete @3 (hash :Data) -> ();
}

Note on delete: In a content-addressed store, deleting a hash can break references from other namespaces pointing to the same object. delete on the base Store interface is dangerously broad — a StoreAdmin interface (separate from Store) may be more appropriate, with delete restricted to a GC service that can verify no live references exist. Open Question #3 (GC) should be resolved before implementing delete. The attenuation table below lists Store (full) as “Read, write, delete any object” — in practice, most callers should receive a Store attenuated to put/get/has only.

Content-addressed means:

  • Deduplication is automatic (same content = same hash)
  • Integrity is verifiable (hash the data, compare)
  • References between objects are just hashes embedded in capnp messages
  • No mutable paths — “updating a file” means storing a new version and updating the reference

Mutable References: Namespaces

A Namespace capability provides mutable name-to-hash mappings on top of the immutable store:

interface Namespace {
    # Resolve a name to a store hash
    resolve @0 (name :Text) -> (hash :Data);
    # Bind a name to a hash (if caller has write authority)
    bind @1 (name :Text, hash :Data) -> ();
    # List names (if caller has list authority)
    list @2 () -> (names :List(Text));
    # Get a sub-namespace (attenuated — restricted to a prefix)
    sub @3 (prefix :Text) -> (ns :Namespace);
}

A Namespace cap scoped to "config/" can only see and modify names under that prefix. This is the analog of a chroot — but structural, not a kernel hack. The sub() method returns a new Namespace cap via IPC cap transfer.

Future: union composition. The research survey recommends extending Namespace with Plan 9-inspired union semantics — a union(other, mode) method that merges two namespaces with before/after/replace ordering. This adds composability without a global mount table. See research.md §6.

IPC and Capability Transfer

Several storage operations return new capabilities: Directory.open() returns a File, Directory.sub() returns a Directory, Namespace.sub() returns a Namespace. This requires dynamic capability management — the kernel must install new capabilities in a process’s CapTable at runtime as part of IPC.

The Capability Ring

All kernel-userspace interaction goes through a shared-memory ring pair (submission queue + completion queue), inspired by io_uring. SQE opcodes map to capnp-rpc Level 1 message types. The ring is allocated per-process at spawn time and mapped into the process’s address space.

Syscall surface: 2 syscalls. New capabilities, operations, and transfer mechanisms are expressed as new SQE opcodes instead of expanding the syscall ABI.

#SyscallPurpose
1exit(code)Terminate process
2cap_enter(min_complete, timeout_ns)Process pending SQEs, then wait until enough CQEs exist or the timeout expires

Writing SQEs is syscall-free, but ordinary capability CALLs make progress through cap_enter. Timer polling handles non-CALL ring work and only CALL targets that explicitly opt into interrupt-context dispatch. cap_enter flushes pending SQEs and can block the process until min_complete completions are available or a finite timeout expires. An indefinite wait uses timeout_ns = u64::MAX; timeout_ns = 0 keeps the call non-blocking. A future SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path without running arbitrary capability methods from timer interrupt context.

The ring structs and synchronous CALL dispatch are implemented and working. See capos-config/src/ring.rs for the shared ring structs and kernel/src/cap/ring.rs for kernel-side processing.

Ring Layout

One 4 KiB page per process, mapped into both kernel (HHDM) and user space:

┌─────────────────────────┐  offset 0
│ Ring Header              │  SQ/CQ head, tail, mask, flags
├─────────────────────────┤  offset 128
│ SQE Array (16 × 64B)    │  submission queue entries
├─────────────────────────┤  offset 1152
│ CQE Array (32 × 32B)    │  completion queue entries
└─────────────────────────┘

SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)

SQE Opcodes

Five opcodes handle everything — client calls, server dispatch, capability transfer, pipelining, and lifecycle:

Opcodecapnp-rpc analogPurpose
CALLCallInvoke method on a capability
RETURNReturnRespond to incoming call (server side)
RECV(implicit)Wait for incoming calls on Endpoint
RELEASEReleaseDrop a capability reference
FINISHFinishRelease pipeline answer state
TIMEOUTPost a CQE after N nanoseconds (io_uring-inspired)

TIMEOUT is an alternative to the timeout_ns argument on cap_enter: it works with zero-syscall polling (kernel fires the CQE on a timer tick) and composes with LINK/DRAIN for deadline-based chains.

SQE flags: PIPELINE (cap_id is a promise reference), LINK (chain to next SQE), MULTISHOT (keep generating CQEs), DRAIN (barrier).

Promise Pipelining

A CALL SQE can target either a concrete CapId or a PromisedAnswer reference (via the PIPELINE flag + pipeline_dep/pipeline_field fields). The kernel resolves the dependency chain internally:

SQE[0]: CALL dir.open("report.pdf")        → user_data=100
SQE[1]: CALL [PIPELINE: dep=100, field=0].read(0, 4096)  → user_data=101

One cap_enter call. The kernel dispatches SQE[0], extracts the File cap from the result, and dispatches SQE[1] against it — all without returning to userspace between steps.

The Endpoint Kernel Object

For cross-process IPC, an Endpoint connects client-side proxy caps to a server’s receive loop:

Client's CapTable                                   Server's CapTable
┌─────────────────┐                                 ┌──────────────────┐
│ cap 2: Proxy     │                                 │ cap 0: Endpoint   │
│   → endpoint ────────── Endpoint ◄──── RECV SQE ──│                  │
│   badge: 42      │     (kernel obj)                │                  │
└─────────────────┘                                 └──────────────────┘

The server posts a RECV SQE (with MULTISHOT flag). Incoming calls appear as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id. The server responds by posting a RETURN SQE referencing the call_id.

interface_id is the transported schema ID for the interface being invoked. It should equal the generated TYPE_ID for that capnp interface. cap_id is the authority-bearing table handle; interface_id is only the protocol tag. The target capability entry owns one public interface; method_id selects a method inside that interface, while cap_id identifies the object being invoked. If the same backing state needs another interface, the transport should mint a separate capability entry for that interface rather than letting one handle accept multiple unrelated interface_id values.

Direct-Switch IPC

When a client’s CALL targets a cap served by a blocked server (waiting on RECV), the kernel marks that server as the direct IPC handoff target so the next context-switch path runs the callee before unrelated round-robin work. The current implementation still uses the ordinary saved-context restore path; small-message register transfer remains a future fastpath after measurement. See research.md §2.

Capability Transfer via Ring

Capabilities travel as sideband arrays (CapTransferDescriptor) alongside capnp message bytes:

  • CALL params: params buffer contains the capnp message bytes followed by xfer_cap_count transfer descriptors packed at addr + len, which must be aligned to CAP_TRANSFER_DESCRIPTOR_ALIGNMENT.
  • RETURN results: server result buffers carry the capnp reply bytes and may carry return transfer descriptors on addr + len; the kernel inserts destination capability records in the caller’s result buffer after the normal result bytes. Count is reported in CQE cap_count and those records are written as CapTransferResult { cap_id, interface_id } values at result_addr + result. The requested result buffer (result_len) must be large enough for both normal reply bytes and all appended cap_count records.

xfer_cap_count > 0 with malformed descriptor metadata (bad mode bits, reserved bits, _reserved0, or misalignment) fails closed as CAP_ERR_INVALID_TRANSFER_DESCRIPTOR. Kernels that have not yet enabled transfer handling should return CAP_ERR_TRANSFER_NOT_SUPPORTED for transfer-bearing SQEs.

The capnp wire format’s WirePointerKind::Other encodes capability indices in messages. The sideband arrays map these indices to actual CapIds. The kernel does not parse capnp messages — it transfers a list of caps alongside the opaque message bytes.

Dynamic Capability Management

Every open(), sub(), or resolve() creates and transfers a new capability at runtime. The kernel’s CapTable insert() and remove() are the primitives. Capabilities flow through RETURN SQE sideband arrays (and through the manifest at boot). No separate cap_grant mechanism needed — authority flow follows the ring’s IPC graph.

The CapTable generation counter handles stale references: when a File cap is closed (slot freed, generation bumps), any cached CapId returns StaleGeneration instead of accidentally hitting a new occupant.

Shared Memory for Bulk Data

Copying file data through capnp Data fields works for metadata and small reads, but is impractical for anything above a few KB. A 1 MB read through a capability CALL copies data four times: device → driver heap → capnp message → kernel buffer → client buffer.

SharedBuffer Capability

A SharedBuffer (also called MemoryObject, listed in ROADMAP.md Stage 6) is a kernel object backed by physical pages that can be mapped into multiple address spaces simultaneously. Zero copies between processes.

interface SharedBuffer {
    # Map into caller's address space (returns virtual address and size)
    map   @0 () -> (addr :UInt64, size :UInt64);
    # Unmap from caller's address space
    unmap @1 () -> ();
    # Size of the buffer
    size  @2 () -> (bytes :UInt64);
}

The kernel creates SharedBuffer objects on request (via a kernel-provided BufferAllocator capability). The pages are reference-counted — the buffer persists as long as any process holds a cap to it.

File I/O with SharedBuffer

File and BlockDevice interfaces support both inline-Data and SharedBuffer modes:

# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}

# Large read: caller provides SharedBuffer, server fills it
let buf = buf_alloc.create(1048576);   # 1 MB SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel

Extended File interface with SharedBuffer support:

interface File {
    read      @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write     @1 (offset :UInt64, data :Data) -> (written :UInt32);
    readBuf   @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
    writeBuf  @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
    stat      @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate  @5 (length :UInt64) -> ();
    sync      @6 () -> ();
    close     @7 () -> ();
}

The readBuf/writeBuf methods accept a SharedBuffer cap (transferred via IPC). The server maps the buffer, performs DMA or memory copies into it, then returns. The caller reads directly from the mapped pages.

For BlockDevice, the same pattern applies — the driver maps the SharedBuffer, programs DMA descriptors pointing to its physical pages, and the device writes directly into the shared memory.

When to Use Each Mode

ScenarioMechanismWhy
Reading a 64-byte config valueFile.read() inline DataCopy overhead negligible
Reading a 10 MB binaryFile.readBuf() SharedBufferAvoids 4× copy overhead
FAT directory entry (32 bytes)BlockDevice.readBlocks() inlineSmall metadata read
Streaming video framesFile.readBuf() + ring of SharedBuffersContinuous zero-copy
Network packet buffersSharedBuffer ring between NIC driver and net stackDMA-capable pages

Attenuation

Storage services mint restricted capabilities using wrapper CapObjects:

CapabilityAuthority
Directory (full)Open, list, mkdir, remove, sub
Directory (read-only)Open (returns read-only Files), list, sub only
File (full)Read, write, truncate, sync
File (read-only)Read and stat only
File (append-only)Read, stat, write at end only
Store (full)Read, write, delete any object
Store (read-only)Get and has only
Namespace (full)Resolve, bind, list under prefix
Namespace (read-only)Resolve and list only
Blob (single object)Read one specific hash
SharedBuffer (read-only)Map as read-only (page table: R, no W)

An application that only needs to read its config gets a read-only Directory scoped to its config path. It can’t write, can’t see other apps’ directories, can’t access the raw BlockDevice.

Naming Without Paths

Traditional OS: process opens /var/lib/myapp/data.db — a global path.

capOS: process receives a Directory or Namespace cap at spawn time, opens "data.db" within it. The process has no idea where on disk this lives. It can’t traverse upward. There is no global root.

# Traditional: global path namespace
/
├── etc/
│   └── myapp/
│       └── config.toml
├── var/
│   └── lib/
│       └── myapp/
│           └── data.db
└── sbin/
    └── myapp

# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
  "config" → Directory(read-only, scoped to myapp's config files)
  "data"   → Directory(read-write, scoped to myapp's data files)
  "state"  → Namespace(read-write, scoped to myapp's store objects)
  "log"    → Console cap
  "api"    → HttpEndpoint cap

The process doesn’t know or care about the backing storage layout. It just uses the capabilities it was granted.

Configuration

Build-Time Config (Boot Manifest)

The system manifest is authored at build time. The human-writable source could be any format — TOML, CUE, or even a Makefile target that generates the capnp binary. What matters is that it compiles to a SystemManifest capnp message baked into the ISO.

Example source (TOML, compiled to capnp by a build tool):

[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
    { name = "device_mmio", source = { kernel = "device_mmio" } },
    { name = "interrupt", source = { kernel = "interrupt" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["nic"]

[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
    { name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
    { name = "timer", source = { kernel = "timer" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["net"]

[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
    { name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]

[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
    { name = "api", source = { service = { service = "http-service", export = "api" } } },
    { name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
    { name = "data", source = { service = { service = "store", export = "namespace" } } },
    { name = "log", source = { kernel = "console" } },
]

A build tool validates this against the capnp schemas (does virtio-net actually export "nic"? does http-service support endpoint() minting?) and produces the binary manifest.

Runtime Config (via Store)

Once the store service is running, configuration can be stored there and updated without rebuilding the ISO. The store is just another capability — a config-management service could watch for changes and signal services to reload.

Connection to Network Transparency

If capabilities are the only abstraction, and capnp is the only wire format, then the transport is irrelevant:

  • Local IPC: capnp message copied between address spaces by kernel
  • Local store: capnp message written to block device
  • Remote IPC: capnp message sent over TCP to another machine
  • Remote store: capnp message fetched from a remote store service

A capability reference doesn’t encode where the backing service lives. The kernel (or a proxy) handles routing. This means:

  • A Directory cap could be backed by local FAT or a remote 9P server
  • A Namespace cap could be backed by local storage or a remote store
  • A Fetch cap could route through a local HTTP service or a remote proxy
  • A ProcessSpawner cap could spawn locally or on a remote machine

The system manifest could describe services that run on different machines, and the capability graph spans the network. This is the “network transparency” item in the roadmap — it falls out naturally from the model.

Persistence of the Capability Graph

The live capability graph (which process holds which caps) is ephemeral — it exists in kernel memory and is lost on reboot. The system manifest describes the intended graph, and init rebuilds it on each boot.

For true persistence (resume after reboot without re-initializing):

  1. Each service serializes its state to the store before shutdown
  2. On next boot, the manifest includes “restore from store hash X” hints
  3. Services read their saved state from the store and resume

This is application-level persistence, not kernel-level. The kernel doesn’t snapshot the capability graph — services are responsible for their own state. This avoids the complexity of EROS-style transparent persistence while still allowing stateful services.

Phases

Phase 1: Boot Manifest (parallel with Stage 4)

  • Define SystemManifest schema in schema/
  • Build tool (tools/mkmanifest) that compiles system.cue into a capnp-encoded manifest and packs it into the ISO as a Limine module
  • Kernel parses the manifest and currently creates one process per ServiceEntry
  • Kernel passes the manifest to init as bytes or a Manifest capability without interpreting the child service graph in the system-spawn.cue transition path
  • Init becomes a generic manifest executor instead of a demo parser for the system-spawn.cue transition path
  • No persistent storage yet — boot image is the only data source

Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)

Depends on: IPC (Stage 6) for cross-process cap transfer.

  • Add BlockDevice, File, Directory, DirEntry, SharedBuffer to schema/capos.capnp
  • Implement kernel Endpoint and RECV/RETURN SQE opcodes
  • Capability transfer in IPC replies (RETURN SQE xfer_caps installs caps in caller’s table)
  • Demo: two-process file server (in-memory File/Directory service + client)

Phase 3: RAM-backed Store (after Phase 2)

Depends on: IPC (Stage 6) for cross-process store access.

  • Implement Store and Namespace as a userspace service
  • Backed by RAM (no disk driver yet, data lost on reboot)
  • Services can store and retrieve capnp objects at runtime
  • Demonstrates the naming model without requiring a block device driver
  • Namespace.sub() returns new caps via IPC cap transfer

Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)

  • virtio-blk driver (userspace, reuses virtqueue infrastructure from networking smoke test)
  • BlockDevice trait implementation
  • FAT filesystem service: wraps BlockDevice, exports Directory/File caps
  • SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
  • Store service uses BlockDevice for persistence
  • System state survives reboot via store + manifest restore hints

Phase 5: Network Store (after networking)

  • Store service can replicate to or fetch from a remote store
  • Capability references transparently span machines
  • Directory cap backed by a remote filesystem (9P-style)

Relationship to Other Proposals

  • Networking proposal — the NIC driver and net stack are services described in the manifest, not hardcoded. The store could be backed by network storage once networking works. A remote Directory cap (9P over capnp) reuses the same File/Directory interfaces.
  • Service architecture proposal — the manifest replaces code-as-config for init. ProcessSpawner, supervision, and cap export work as described there, but driven by manifest data instead of compiled Rust code. IPC Endpoints are the mechanism for service export.
  • Capability model — IPC cap transfer (Endpoint + RETURN SQE) is the mechanism that makes open() and resolve() work. SharedBuffer is the bulk data path that makes file I/O practical. Both are tracked in ROADMAP.md Stage 6.

Open Questions

  1. Manifest validation. How much can the build tool verify statically? Cap export names depend on runtime behavior of services. Should services declare their exports in their own metadata (like a package manifest)?

  2. Schema evolution. When a service’s capnp interface changes, stored objects referencing the old schema need migration. Cap’n Proto has backwards-compatible schema evolution, but breaking changes need a story.

  3. Garbage collection. Content-addressed store accumulates unreferenced objects. Who GCs? A separate service with Store read + delete authority? Reference counting in the namespace layer?

  4. Large objects. Storing multi-megabyte binaries as single capnp Data fields is wasteful (capnp allocates contiguously). SharedBuffer partially addresses this for I/O, but the Store’s put/get interface still takes Data. Options: chunked storage (Merkle tree of hashes), a streaming Blob interface, or SharedBuffer-aware Store methods.

  5. Trust model for the manifest. The boot manifest has full authority to define the system. Who signs it? How do you prevent a tampered ISO from granting excessive caps? Secure boot integration?

  6. File locking and concurrent access. Multiple processes opening the same file through the same filesystem service need coordination. Options: mandatory locking in the filesystem service (rejects conflicting opens), advisory locking via a separate Lock capability, or single-writer enforcement at the Directory level (open with exclusive flag).

  7. RETURN+RECV atomicity. When a server posts a RETURN SQE followed by a RECV SQE, there must be no window where a client call can arrive but the server isn’t listening. SQE LINK chaining (RETURN → RECV) should provide this atomicity — the kernel processes both SQEs as a unit.