Proposal: Storage, Naming, and Persistence
What replaces the filesystem in a capability OS where Cap’n Proto is the universal wire format.
The Problem with Filesystems
In Unix, the filesystem is the universal namespace. Everything is a path:
/dev/sda, /etc/config, /proc/self/fd/3, /run/dbus/system_bus_socket.
Paths are ambient authority — any process can open /etc/passwd if the
permission bits allow. The filesystem conflates naming, access control,
persistence, and device abstraction into one mechanism.
capOS has capabilities instead of paths. Access control is structural (you can only use what you were granted), not advisory (permission bits checked at open time). This means:
- No global namespace needed — each process sees only its granted caps
- No path-based access control — the cap IS the access
- No distinction between “file”, “device”, “socket” — everything is a typed capability interface
A traditional VFS would reintroduce ambient authority through the back door. Instead, capOS needs a storage and naming model native to capabilities and Cap’n Proto.
Core Insight: Cap’n Proto Everywhere
Cap’n Proto is already used in capOS for:
- Interface definitions —
.capnpschemas define capability contracts - IPC messages — capability invocations are capnp messages
- Serialization — capnp wire format crosses process boundaries
If we extend this to storage, then:
- Stored objects are capnp messages
- Configuration is capnp structs
- Binary images are capnp-wrapped blobs
- The boot manifest is a capnp message describing the initial capability graph
No format conversion anywhere. The same tools (schema compiler, serializer, validator) work for IPC, storage, config, and network transfer.
Architecture
Three Layers
Target architecture after the manifest executor and process-spawner work:
Boot Image (read-only, baked into ISO)
│
│ capnp-encoded manifest + binaries
│
v
Kernel (creates initial caps from manifest)
│
│ grants caps to init
│
v
Init (builds live capability graph)
│
├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
│
├──> Store service (capability-native content-addressed storage)
│ backed by: virtio-blk, RAM, or network
│
└──> All other services (receive Directory, Store, or Namespace caps)
Layer 1: Boot Image
The boot image (ISO/disk) contains a capnp-encoded system manifest loaded as a Limine module alongside the kernel. The manifest describes:
struct SystemManifest {
# Manifest schema version, validated before other fields
schemaVersion @0 :UInt32;
# Binaries available at boot, keyed by name
binaries @1 :List(NamedBlob);
# Init's config blob: first-process metadata plus service graph
initConfig @2 :CueValue;
# Kernel boot parameters
kernelParams @3 :SystemConfig;
}
struct NamedBlob {
name @0 :Text;
data @1 :Data;
}
struct CueValue {
union {
null @0 :Void;
boolean @1 :Bool;
intValue @2 :Int64;
uintValue @3 :UInt64;
text @4 :Text;
bytes @5 :Data;
list @6 :List(CueValue);
fields @7 :List(CueField);
}
}
struct CueField {
name @0 :Text;
value @1 :CueValue;
}
Capability source identity is already structured in the bootstrap manifest, so source selection does not depend on parsing authority strings:
{
name: "client"
expectedInterfaceId: 0xacf0c15a7b2e0041
source: service: {
service: "endpoint-server"
export: "client"
}
}
Kernel and service source objects inside initConfig select the authority to grant. The
expectedInterfaceId field carries the generated Cap’n Proto interface
TYPE_ID and only checks that the granted object speaks the expected schema.
It cannot replace source identity: many different objects may expose the same
interface while representing different authority.
The build system (Makefile) generates this manifest from a human-authored
description and packs it into the ISO as manifest.bin. Current code embeds
every SystemManifest.binaries entry into that manifest as NamedBlob data,
including the release-built init and smoke-demo ELFs. The kernel now boots only
initConfig.init; focused init-executor manifests expose the manifest to the
separate init binary as a read-only BootPackage capability, while default
shell-led manifests boot capos-shell directly without a BootPackage executor.
Remaining cleanup is to narrow the long-term boot package shape after the
single-init split.
Using a CueValue tree instead of AnyPointer keeps the manifest directly
decodable in no_std userspace without depending on Cap’n Proto reflection.
Transitional Schema Note
ServiceEntry, CapSource::Service, and ServiceEntry.exports are no longer
kernel schema fields. ProcessSpawner, copy/move cap transfer, focused
init-owned generic manifest execution, the default standalone-init service
graph, focused shell-led login smokes, and the 15.4 initConfig schema split
are implemented. The current boot manifest shape is:
struct SystemManifest {
# Manifest schema version, validated before other fields
schemaVersion @0 :UInt32;
# Binaries available at boot, keyed by name
binaries @1 :List(NamedBlob);
# Init's config blob (replaces the service graph)
initConfig @2 :CueValue;
# Kernel boot parameters (serial policy, shell MOTD, feature flags)
kernelParams @3 :SystemConfig;
}
ServiceEntry / CapRef disappeared from the schema and became plain CUE
fields inside initConfig.services. Init reads them at runtime and calls
ProcessSpawner directly. validate_manifest_graph,
validate_bootstrap_cap_sources, and the remaining transitional service-graph
schema are no longer kernel bootstrap checks. They remain in capos-config for
mkmanifest and the focused init executor while that executor still accepts the
transitional service graph. Kernel bootstrap already uses a first-service
cap-table builder rather than the old multi-service resolver. See
docs/proposals/service-architecture-proposal.md — “Legacy Manifest Fields
After Stage 6” for the deprecation plan.
During the current transition, initConfig.init is still per-manifest launch
metadata: it selects the single boot process binary and the kernel-sourced caps
for that process. initConfig.services, cross-service cap sources, exports,
and restart policy are init-owned configuration for focused executor manifests.
Focused harnesses that boot a demo as init keep using that first-process cap
bundle until those smokes are migrated behind a fixed generic init.
Layer 2: Kernel Bootstrap
Target design for the kernel’s boot role:
- Parse the system manifest (read-only capnp message from Limine module).
- Hash the embedded binaries for optional measured-boot attestation.
- Create kernel-provided capabilities:
Console,Timer,DeviceManager,ProcessSpawner,FrameAllocator,VirtualMemory(per-process), and a read-onlyBootPackagecap exposingSystemManifest.binariesandinitConfig. - Spawn init — exactly one userspace process — with that cap bundle.
Current boot has reached the single-init split and the initConfig schema
split. system.cue puts the standalone init binary in initConfig.init for
the default service-graph process; init reads BootPackage and starts the
shell, remote-session CapSet gateway, and resident services from
initConfig.services.
Focused shell-led manifests such as system-smoke.cue still put
capos-shell in initConfig.init for narrow login proofs. Focused
init-executor manifests such as system-spawn.cue also put the separate
init binary in initConfig.init; that binary reads BootPackage and spawns
the focused demo graph from initConfig.services through ProcessSpawner.
The unused kernel resolver has been retired. The remaining cleanup is replacing
per-manifest init bundles with a fixed generic-init bootstrap ABI.
Layer 3: Init and the Live Capability Graph
Target init reads initConfig from the BootPackage cap and executes it:
fn main(caps: CapSet) {
let spawner = caps.get::<ProcessSpawner>("spawner");
let boot = caps.get::<BootPackage>("boot");
let config = boot.init_config()?; // CueValue
// Walk service entries from the config and spawn in dependency order
for entry in config.field("services")?.iter()? {
let binary = boot.binary(entry.field("binary")?.as_str()?)?;
let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
running_services.insert(entry.field("name")?.as_str()?.into(), handle);
}
supervisor_loop(&running_services);
}
In this target model, init is a generic manifest executor rather than a
hardcoded service graph. The system topology is defined in the boot
package’s initConfig, not in init’s source code. Changing what services
run means rebuilding the boot image with a different config blob, not
recompiling init. Manifest graph resolution stops being a kernel concern.
The current transition uses initConfig.services as the service graph; init
reads the BootPackage manifest, validates a metadata-only
ManifestBootstrapPlan, resolves kernel and service cap sources, records
exported caps, spawns children in manifest order, and waits for their
ProcessHandles.
Two Storage Models
capOS supports two complementary storage models, both exposed as typed capabilities:
Filesystem Capabilities (Directory, File)
For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and
for POSIX compatibility. A filesystem service wraps a BlockDevice and
exports Directory/File capabilities.
BlockDevice (raw sectors)
│
└──> Filesystem service (FAT, ext4, ...)
│
├──> Directory caps (namespace over files)
└──> File caps (read/write byte streams)
This model maps naturally to USB flash drives, NVMe partitions, and
network-mounted filesystems. The open() and sub() operations return new
capabilities via IPC cap transfer (see “IPC and Capability Transfer” below).
Capability-Native Store (Store, Namespace)
For capOS-native data: configuration, service state, content-addressed object
storage. A store service wraps a BlockDevice and exports Store/Namespace
capabilities.
BlockDevice (raw sectors)
│
└──> Store service
│
├──> Store cap (content-addressed put/get/list inventory)
└──> Namespace caps (mutable name→hash mappings)
Content-addressing provides automatic deduplication, verifiable integrity,
and immutable references. Store.list returns the live inventory of content
hashes in that Store, so holders that need crash/reboot recovery can rediscover
stored content without a separate mutable root pointer. Namespaces add mutable
bindings on top when callers need stable names rather than inventory scans.
Bridging the Two Models
The models are composable. An adapter service can bridge between them:
- FsStore adapter: exposes a Directory tree as a content-addressed Store (hash each file’s contents, directory listings become capnp-encoded objects)
- StoreFS adapter: exposes Store/Namespace as a Directory tree (each name maps to a File whose contents are the stored object)
- Import/export: a utility service reads files from a Directory and stores them in a Store, or materializes Store objects as files in a Directory
In both cases the adapter is a userspace service holding caps to both subsystems. No kernel mechanism needed — just capability composition.
File I/O Interfaces
Directory, File, Store, and Namespace caps may be scoped to a user session, guest profile, anonymous request, or service identity, but the cap remains the authority. POSIX ownership metadata is compatibility data inside these services, not a system-wide authorization channel. See User Identity and Policy.
BlockDevice
Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass
storage). The driver receives hardware capabilities (MMIO, IRQ,
FrameAllocator for DMA) and exports a BlockDevice cap.
interface BlockDevice {
readBlocks @0 (startLba :UInt64, count :UInt32) -> (data :Data);
writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
info @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
flush @3 () -> ();
}
For bulk transfers, readBlocks/writeBlocks accept a SharedBuffer
capability instead of inline Data (see “Shared Memory for Bulk Data”
below). The inline-Data variants work for metadata reads and small
operations; the SharedBuffer variants avoid copies for large I/O.
File
Byte-stream access to a single file. Served by filesystem services. Created
dynamically when a client calls Directory.open() — the filesystem service
creates a File CapObject for the opened file and transfers it to the
caller via IPC cap transfer.
interface File {
read @0 (offset :UInt64, length :UInt32) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> (written :UInt32);
stat @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
truncate @3 (length :UInt64) -> ();
sync @4 () -> ();
close @5 () -> ();
}
close releases the server-side state for this file (open cluster chain
cache, dirty buffers). The kernel-side CapTable entry is removed by the system
transport via CAP_OP_RELEASE when the local holder releases it; capos-rt
owned handles queue local releases on final drop and expose explicit release
flushing for ordinary userspace. CapabilityManager is
management-only (list(), later grant()); it does not expose a drop()
method because ordinary handle lifetime belongs to the transport, not to an
application call on the same table that dispatches it.
Attenuation: a read-only File wraps the original and rejects write,
truncate, sync calls. An append-only File rejects write at offsets
other than the current size.
Directory
Namespace over files on a filesystem. Served by filesystem services.
open() and sub() return new capabilities via IPC cap transfer.
interface Directory {
open @0 (name :Text, flags :UInt32) -> (file :File);
list @1 () -> (entries :List(DirEntry));
mkdir @2 (name :Text) -> (dir :Directory);
remove @3 (name :Text) -> ();
sub @4 (name :Text) -> (dir :Directory);
create @5 (name :Text) -> ();
rename @6 (from :Text, to :Text) -> ();
}
struct DirEntry {
name @0 :Text;
size @1 :UInt64;
isDir @2 :Bool;
}
sub() returns a Directory scoped to a subdirectory — the analog of chroot.
The caller cannot traverse upward or see the parent directory. open() with
create flags creates a new file if it doesn’t exist.
The flags field in open() is a bitmask: CREATE = 1, TRUNCATE = 2,
APPEND = 4. No READ/WRITE flags — those are determined by the
Directory cap’s attenuation (a read-only Directory returns read-only Files).
Writable Directory Mutations and the Single-Writer Policy
create @5 makes a new empty file and rename @6 renames an entry within the
same parent. Both have additive ordinals so the read-only Directory
implementations stay wire-compatible — they simply reject the mutating methods
(mkdir/remove/sub/create/rename) fail-closed, the way a read-only
File rejects write. Unlike open with CREATE, create fails closed if the
name already exists; rename fails closed if the source is absent or the
destination already exists, and does not support cross-directory moves.
The first writable filesystem service adopts a fail-closed single-writer
policy: a writable filesystem tree admits one writer at a time. The first
granted cap to perform a mutation claims the writer slot; a mutation through any
other concurrently granted cap fails closed with a typed Failed exception
("writable filesystem rejects a second concurrent writer (single-writer policy)") rather than racing. There is no lease/release lifecycle — the first
writer keeps the slot — and list/sub reads are allowed for any holder. This
deliberately closes the milestone’s concurrent-writer-policy decision without
expanding scope to advisory locks, lock leases, or multi-writer coordination
(see Open Question 6). The implementation (kernel/src/cap/writable_fs.rs, proof
make run-storage-writable) is now disk-backed: it mounts a CAPOSWF1
sub-volume (a flat node-record array with parent pointers plus a bump-allocated
data region) over the kernel-owned virtio-blk driver, keeps the RAM tree as the
working copy, and write-through-commits every directory/file mutation in the
order data sector → node-record sector → superblock (the ordering commit point),
mirroring the disk-backed Store. The persistent Store CAPOSST1 sub-volume
co-locates on the same disk image (at LBA 0; the filesystem superblock sits at a
fixed higher LBA), so filesystem mutations and store object writes/deletes
survive a reboot together — make run-storage-writable boots QEMU twice against
one combined image and phase 2 verifies every surviving name, size, content,
directory entry, and store object plus the deleted object’s absence.
Unclean-shutdown recovery is proven by make run-storage-writable-recovery. A
slot becomes live on the next mount only once the superblock’s bumped
node_count is observed, so a forced poweroff in the window between a node
record’s durable write and that commit leaves an orphan slot the next mount
ignores: the interrupted allocation is atomically absent, never a torn or
half-live entry. The proof builds the kernel with the proof-only
storage_writable_recovery feature, which arms an induced forced poweroff in
exactly that window (recovery_crash_after_record); pass 1 commits durable
mutations and a Store survivor and then triggers the window (the harness
kill -9s QEMU after the kernel marker), and pass 2 re-mounts and verifies
recovery to a consistent tree with the committed state intact, the interrupted
allocation absent, no torn record, and a usable post-recovery write. The proof
is bounded to that single record-vs-commit window under host-page-cache
durability (the virtio driver negotiates no VIRTIO_BLK_F_FLUSH, and a
kill -9 preserves the host page cache); it proves the superblock-commit
ordering invariant, not a general media crash-consistency guarantee against
host power loss or a lost write-back cache. The co-located CAPOSST1 Store
now has bounded tombstone reclamation through make run-storage-persist; this
does not add a new media power-loss guarantee or reclaim writable-file extents.
Writable File content paths layer onto the same tree. open with the
CREATE/TRUNCATE/APPEND flags (or a write through the returned File)
claims the same filesystem-wide writer slot, so file writes obey the single
writer policy alongside directory mutations; a plain (flags == 0) open and the
read/stat methods are reads allowed for any holder. write @1 overwrites or
extends at the supplied offset, zero-filling any gap; a handle opened APPEND
lands every write at end-of-file regardless of the offset argument. truncate @3
shrinks (discards the tail) or extends (zero-fills) the file, and close @5
releases only that handle — the file survives in the directory until
Directory.remove, which marks the file node so any outstanding File cap fails
closed. File content is bounded by MAX_FILE_BYTES (64 KiB) and persists to a
bump-allocated disk extent on each mutation; a rewrite that outgrows the current
extent allocates a fresh one and leaks the old (file-extent compaction deferred).
Because
each write/truncate already wrote through the block device (the virtio
driver negotiates no VIRTIO_BLK_F_FLUSH, so there is no separate media barrier
to issue), sync @4 succeeds as an honest write-side no-op (a read-only File
still rejects it). Crash consistency rests on the superblock-commit ordering
rather than a media barrier: an interrupted allocation is atomically absent on
remount (proven by make run-storage-writable-recovery, above). A post-write
media-durability flush against a write-back cache (for host power loss, not the
guest-side forced poweroff that proof exercises) remains future hardening, not
claimed here.
Syscall Trace: Reading a File from a FAT USB Drive
Four userspace processes: App, FAT service, USB mass storage, xHCI driver.
With promise pipelining (one submission):
Cap’n Proto promise pipelining lets the App chain dependent calls without waiting for intermediate results. The App submits a single pipelined request: “open this file, then read from the result”:
# Single pipelined submission (SQEs with PIPELINE flag):
# call 0: dir.open("report.pdf") → answer_id=200, user_data=100
# call 1: answer 200 result_cap[0].read(offset=0, len=4096)
cap_submit([
{cap=2, method=OPEN, answer=200, user_data=100, params={"report.pdf", flags=0}},
{cap=PIPELINE(answer=200, result_cap=0), method=READ, user_data=101, params={offset:0, length:4096}},
])
→ kernel routes call 0 to FAT service via Endpoint
→ FAT service reads directory entry from BlockDevice
→ FAT service creates FileCapObject, replies with File cap as result cap 0
→ kernel sees pipelined call 1 targeting the File cap from call 0
→ kernel dispatches call 1 to the same FAT service (or direct-invokes
the new File CapObject if it's a local endpoint)
→ FAT service maps offset → cluster chain → LBA
→ FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
→ USB mass storage → xHCI → hardware → back up
← completion: {data: [4096 bytes]}, File cap installed as cap_id=5
One app-to-kernel transition. The kernel resolves the pipeline dependency
internally through the sideband CapTransferResult record at index 0; it does
not inspect the Cap’n Proto result payload. The App never needs a userspace
round trip for the intermediate File cap, though the cap is installed and usable
afterward.
This is a core Cap’n Proto feature: by expressing “call method on the
not-yet-resolved result of another call,” the client avoids a round-trip
for each link in the chain. For deeper chains (e.g., dir.sub("a").sub("b") .open("file").read(0, 4096)), the savings compound — one submission instead
of four sequential syscalls.
The capability-ring version should follow the Cap’n Proto/CapTP prior-art shape captured in Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP: pipelined targets live in answer/result-cap namespaces, not in caller-selected global ids; result-cap metadata stays outside the Cap’n Proto payload; broken answers propagate failure to dependent calls; and answer slots, queued dependent calls, queued bytes, and remote references are charged to bounded resource ledgers. This is design grounding, not an OCapN or Cap’n Web wire-compatibility target.
Without pipelining (two sequential ring submissions):
Without promise pipelining, the App submits two separate CALL SQEs via the ring, blocking on each completion before submitting the next:
# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
→ kernel routes CALL to FAT service via Endpoint
→ FAT service reads directory entry from BlockDevice
→ FAT service creates FileCapObject for this file
→ FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
→ kernel installs File cap in App's table → cap_id=5
← App reads CQE: result={file: cap_index=0}, new_caps=[5]
# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
→ kernel routes CALL to FAT service
→ FAT service maps offset → cluster chain → LBA
→ FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
→ kernel routes to USB mass storage
→ mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
→ kernel routes to xHCI driver
→ xHCI programs TRBs, waits for interrupt
← returns raw sector data
← returns sector data
← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}
This works but costs two round-trips where pipelining needs one. The synchronous path is useful for simple cases and bootstrapping; pipelining is the intended steady-state model.
In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI) are invisible to the App.
Capability-Native Store
The Store Capability
Once the system is running, persistent storage is provided by a userspace service — the store. It’s backed by a block device (virtio-blk), and exposes a content-addressed object store where objects are capnp messages.
interface Store {
# Store a capnp message, returns its content hash
put @0 (data :Data) -> (hash :Data);
# Retrieve by hash
get @1 (hash :Data) -> (data :Data);
# Check existence
has @2 (hash :Data) -> (exists :Bool);
# Delete (if caller has authority — see note below)
delete @3 (hash :Data) -> ();
}
Note on delete: In a content-addressed store, deleting a hash can break
references from other namespaces pointing to the same object. delete on the
base Store interface is dangerously broad — a StoreAdmin interface
(separate from Store) may be more appropriate, with delete restricted to a
GC service that can verify no live references exist. Open Question #3 (GC)
should be resolved before implementing delete. The attenuation table below
lists Store (full) as “Read, write, delete any object” — in practice, most
callers should receive a Store attenuated to put/get/has only.
Content-addressed means:
- Deduplication is automatic (same content = same hash)
- Integrity is verifiable (hash the data, compare)
- References between objects are just hashes embedded in capnp messages
- No mutable paths — “updating a file” means storing a new version and updating the reference
Mutable References: Namespaces
A Namespace capability provides mutable name-to-hash mappings on top of
the immutable store:
interface Namespace {
# Resolve a name to a store hash
resolve @0 (name :Text) -> (hash :Data);
# Bind a name to a hash (if caller has write authority)
bind @1 (name :Text, hash :Data) -> ();
# List names (if caller has list authority)
list @2 () -> (names :List(Text));
# Get a sub-namespace (attenuated — restricted to a prefix)
sub @3 (prefix :Text) -> (ns :Namespace);
}
A Namespace cap scoped to "config/" can only see and modify names under
that prefix. This is the analog of a chroot — but structural, not a kernel
hack. The sub() method returns a new Namespace cap via IPC cap transfer.
Future: union composition. The research survey recommends
extending Namespace with Plan 9-inspired union semantics — a union(other, mode) method that merges two namespaces with before/after/replace ordering.
This adds composability without a global mount table. See
research survey §6.
IPC and Capability Transfer
Several storage operations return new capabilities: Directory.open()
returns a File, Directory.sub() returns a Directory, Namespace.sub()
returns a Namespace. This requires dynamic capability management — the kernel
must install new capabilities in a process’s CapTable at runtime as part of
IPC.
The Capability Ring
All kernel-userspace interaction goes through a shared-memory ring pair (submission queue + completion queue), inspired by io_uring. SQE opcodes map to capnp-rpc Level 1 message types. The ring is allocated per-process at spawn time and mapped into the process’s address space.
Syscall surface: 2 syscalls. New capabilities, operations, and transfer mechanisms are expressed as new SQE opcodes instead of expanding the syscall ABI.
| # | Syscall | Purpose |
|---|---|---|
| 1 | exit(code) | Terminate current thread; process exits after its last live thread |
| 2 | cap_enter(min_complete, timeout_ns) | Process pending SQEs, then wait until enough CQEs exist or the timeout expires |
Writing SQEs is syscall-free, but ordinary capability CALLs make progress
through cap_enter. Timer polling handles non-CALL ring work and only CALL
targets that explicitly opt into interrupt-context dispatch. cap_enter
flushes pending SQEs and can block the process until min_complete
completions are available or a finite timeout expires. An indefinite wait uses
timeout_ns = u64::MAX; timeout_ns = 0 keeps the call non-blocking. A future
SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path
without running arbitrary capability methods from timer interrupt context.
The ring structs and synchronous CALL dispatch are implemented and working.
See capos-config/src/ring.rs for the shared ring structs and
kernel/src/cap/ring.rs for kernel-side processing.
Ring Layout
One 4 KiB page per process, mapped into both kernel (HHDM) and user space:
┌─────────────────────────┐ offset 0
│ Ring Header │ SQ/CQ head, tail, mask, flags
├─────────────────────────┤ offset 128
│ SQE Array (16 × 64B) │ submission queue entries
├─────────────────────────┤ offset 1152
│ CQE Array (32 × 32B) │ completion queue entries
└─────────────────────────┘
SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)
SQE Opcodes
Five opcodes handle everything — client calls, server dispatch, capability transfer, pipelining, and lifecycle:
| Opcode | capnp-rpc analog | Purpose |
|---|---|---|
CALL | Call | Invoke method on a capability |
RETURN | Return | Respond to incoming call (server side) |
RECV | (implicit) | Wait for incoming calls on Endpoint |
RELEASE | Release | Drop a capability reference |
FINISH | Finish | Release pipeline answer state |
TIMEOUT | — | Post a CQE after N nanoseconds (io_uring-inspired) |
TIMEOUT is an alternative to the timeout_ns argument on cap_enter:
it works with zero-syscall polling (kernel fires the CQE on a timer tick)
and composes with LINK/DRAIN for deadline-based chains.
SQE flags: PIPELINE (cap_id is a promise reference), LINK (chain to
next SQE), MULTISHOT (keep generating CQEs), DRAIN (barrier).
Promise Pipelining
A CALL SQE can target either a concrete CapId or a PromisedAnswer
reference (via the PIPELINE flag + pipeline_dep/pipeline_field fields).
pipeline_dep names the earlier answer and pipeline_field is a zero-based
CapTransferResult record index in that answer’s sideband result-cap list, not
a Cap’n Proto schema field. The kernel resolves the dependency chain internally:
SQE[0]: CALL dir.open("report.pdf") → answer_id=200, user_data=100
SQE[1]: CALL [PIPELINE: dep=200, result_cap=0].read(0, 4096) → user_data=101
One cap_enter call. The kernel dispatches SQE[0], resolves result cap record
0 from the completion sideband, and dispatches SQE[1] against it without
returning to userspace between steps or parsing the result payload.
The Endpoint Kernel Object
For cross-process IPC, an Endpoint connects client-side proxy caps to a server’s receive loop:
Client's CapTable Server's CapTable
┌─────────────────┐ ┌──────────────────┐
│ cap 2: Proxy │ │ cap 0: Endpoint │
│ → endpoint ────────── Endpoint ◄──── RECV SQE ──│ │
│ badge: 42 │ (kernel obj) │ │
└─────────────────┘ └──────────────────┘
The server posts a RECV SQE (with MULTISHOT flag). Incoming calls appear
as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id.
The server responds by posting a RETURN SQE referencing the call_id.
interface_id is the transported schema ID for the interface being invoked.
It should equal the generated TYPE_ID for that capnp interface. cap_id is
the authority-bearing table handle; interface_id is only the protocol tag.
The target capability entry owns one public interface; method_id selects a
method inside that interface, while cap_id identifies the object being
invoked. If the same backing state needs another interface, the transport
should mint a separate capability entry for that interface rather than letting
one handle accept multiple unrelated interface_id values.
Direct-Switch IPC
When a client’s CALL targets a cap served by a blocked server (waiting on RECV), the kernel marks that server as the direct IPC handoff target so the next context-switch path runs the callee before unrelated round-robin work. The current implementation still uses the ordinary saved-context restore path; small-message register transfer remains a future fastpath after measurement. See research survey §2.
Capability Transfer via Ring
Capabilities travel as sideband arrays (CapTransferDescriptor) alongside capnp
message bytes:
- CALL params: params buffer contains the capnp message bytes followed by
xfer_cap_counttransfer descriptors packed ataddr + len, which must be aligned toCAP_TRANSFER_DESCRIPTOR_ALIGNMENT. - RETURN results: server result buffers carry the capnp reply bytes and may
carry return transfer descriptors on
addr + len; the kernel inserts destination capability records in the caller’s result buffer after the normal result bytes. Count is reported in CQEcap_countand those records are written asCapTransferResult { cap_id, interface_id }values atresult_addr + result. The requested result buffer (result_len) must be large enough for both normal reply bytes and all appendedcap_countrecords.
xfer_cap_count > 0 with malformed descriptor metadata (bad mode bits, reserved
bits, _reserved0, or misalignment) fails closed as
CAP_ERR_INVALID_TRANSFER_DESCRIPTOR. Kernels that have not yet enabled transfer
handling should return CAP_ERR_TRANSFER_NOT_SUPPORTED for transfer-bearing SQEs.
The capnp wire format’s WirePointerKind::Other encodes capability indices
in messages. The sideband arrays map these indices to actual CapIds. The
kernel does not parse capnp messages — it transfers a list of caps alongside
the opaque message bytes.
Dynamic Capability Management
Every open(), sub(), or resolve() creates and transfers a new
capability at runtime. The kernel’s CapTable insert() and remove() are
the primitives. Capabilities flow through RETURN SQE sideband arrays (and
through the manifest at boot). No separate cap_grant mechanism needed —
authority flow follows the ring’s IPC graph.
The CapTable generation counter handles stale references: when a File cap is
closed (slot freed, generation bumps), any cached CapId returns
StaleGeneration instead of accidentally hitting a new occupant.
Shared Memory for Bulk Data
Copying file data through capnp Data fields works for metadata and small
reads, but is impractical for anything above a few KB. A 1 MB read through
a capability CALL copies data four times: device → driver heap → capnp
message → kernel buffer → client buffer.
SharedBuffer Capability
SharedBuffer is the service-facing name this proposal uses for bulk-transfer
buffers. The implemented kernel/user substrate is MemoryObject: a capability
backed by physical pages that can be mapped into multiple address spaces
simultaneously. Zero copies between processes.
interface MemoryObject {
# Size and page count of the backing object.
info @0 () -> (pageCount :UInt32, sizeBytes :UInt64);
# Map a page-aligned object range into the caller's address space.
map @1 (hint :UInt64, offset :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
# Unmap a caller-local borrowed mapping backed by this object.
unmap @2 (addr :UInt64, size :UInt64) -> ();
# Update caller-local page permissions for a borrowed mapping.
protect @3 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
The kernel creates MemoryObjects through the existing FrameAllocator
capability. Held MemoryObject caps charge the holder’s frame-grant quota; mapped
address-space pages are tracked as borrowed pages and keep the same backing
alive until unmapped or process teardown. A later SharedBuffer alias or
allocator may wrap this ABI for storage/network interfaces, but current code
should use MemoryObject directly.
File I/O with SharedBuffer
File and BlockDevice interfaces support both inline-Data and SharedBuffer modes:
# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}
# Large read: caller provides SharedBuffer, server fills it
let buf = frame_alloc.allocContiguous(256); # 1 MB MemoryObject / SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel
Extended File interface with SharedBuffer support:
interface File {
read @0 (offset :UInt64, length :UInt32) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> (written :UInt32);
readBuf @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
writeBuf @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
stat @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
truncate @5 (length :UInt64) -> ();
sync @6 () -> ();
close @7 () -> ();
}
The readBuf/writeBuf methods accept a SharedBuffer cap, currently a
MemoryObject cap transferred via IPC. The server maps the buffer, performs DMA
or memory copies into it, then returns. The caller reads directly from the
mapped pages.
For BlockDevice, the same pattern applies — the driver maps the SharedBuffer, programs DMA descriptors pointing to its physical pages, and the device writes directly into the shared memory.
When to Use Each Mode
| Scenario | Mechanism | Why |
|---|---|---|
| Reading a 64-byte config value | File.read() inline Data | Copy overhead negligible |
| Reading a 10 MB binary | File.readBuf() SharedBuffer | Avoids 4× copy overhead |
| FAT directory entry (32 bytes) | BlockDevice.readBlocks() inline | Small metadata read |
| Streaming video frames | File.readBuf() + ring of SharedBuffers | Continuous zero-copy |
| Network packet buffers | SharedBuffer ring between NIC driver and net stack | DMA-capable pages |
Attenuation
Storage services mint restricted capabilities using wrapper CapObjects:
| Capability | Authority |
|---|---|
Directory (full) | Open, list, mkdir, remove, sub |
Directory (read-only) | Open (returns read-only Files), list, sub only |
File (full) | Read, write, truncate, sync |
File (read-only) | Read and stat only |
File (append-only) | Read, stat, write at end only |
Store (full) | Read, write, delete any object |
Store (read-only) | Get and has only |
Namespace (full) | Resolve, bind, list under prefix |
Namespace (read-only) | Resolve and list only |
Blob (single object) | Read one specific hash |
SharedBuffer (read-only) | Map as read-only (page table: R, no W) |
An application that only needs to read its config gets a read-only
Directory scoped to its config path. It can’t write, can’t see other
apps’ directories, can’t access the raw BlockDevice.
Naming Without Paths
Traditional OS: process opens /var/lib/myapp/data.db — a global path.
capOS: process receives a Directory or Namespace cap at spawn time,
opens "data.db" within it. The process has no idea where on disk this
lives. It can’t traverse upward. There is no global root.
# Traditional: global path namespace
/
├── etc/
│ └── myapp/
│ └── config.toml
├── var/
│ └── lib/
│ └── myapp/
│ └── data.db
└── sbin/
└── myapp
# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
"config" → Directory(read-only, scoped to myapp's config files)
"data" → Directory(read-write, scoped to myapp's data files)
"state" → Namespace(read-write, scoped to myapp's store objects)
"log" → Console cap
"api" → HttpEndpoint cap
The process doesn’t know or care about the backing storage layout. It just uses the capabilities it was granted.
Configuration
Build-Time Config (Boot Manifest)
The system manifest is authored at build time. The human-writable source
could be any format — TOML, CUE, or even a Makefile target that generates
the capnp binary. What matters is that it compiles to a SystemManifest
capnp message baked into the ISO.
Example source (TOML, compiled to capnp by a build tool):
[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
{ name = "device_mmio", source = { kernel = "device_mmio" } },
{ name = "interrupt", source = { kernel = "interrupt" } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["nic"]
[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
{ name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
{ name = "timer", source = { kernel = "timer" } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["net"]
[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
{ name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]
[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
{ name = "api", source = { service = { service = "http-service", export = "api" } } },
{ name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
{ name = "data", source = { service = { service = "store", export = "namespace" } } },
{ name = "log", source = { kernel = "console" } },
]
A build tool validates this against the capnp schemas (does virtio-net
actually export "nic"? does http-service support endpoint() minting?)
and produces the binary manifest.
Runtime Config (via Store)
Once the store service is running, configuration can be stored there and updated without rebuilding the ISO. The store is just another capability — a config-management service could watch for changes and signal services to reload.
Connection to Network Transparency
If capabilities are the only abstraction, and capnp is the only wire format, then the transport is irrelevant:
- Local IPC: capnp message copied between address spaces by kernel
- Local store: capnp message written to block device
- Remote IPC: capnp message sent over TCP to another machine
- Remote store: capnp message fetched from a remote store service
A capability reference doesn’t encode where the backing service lives. The kernel (or a proxy) handles routing. This means:
- A
Directorycap could be backed by local FAT or a remote 9P server - A
Namespacecap could be backed by local storage or a remote store - A
Fetchcap could route through a local HTTP service or a remote proxy - A
ProcessSpawnercap could spawn locally or on a remote machine
The system manifest could describe services that run on different machines, and the capability graph spans the network. This is the “network transparency” item in the roadmap — it falls out naturally from the model.
Persistence of the Capability Graph
The live capability graph (which process holds which caps) is ephemeral — it exists in kernel memory and is lost on reboot. The system manifest describes the intended graph, and init rebuilds it on each boot.
For true persistence (resume after reboot without re-initializing):
- Each service serializes its state to the store before shutdown
- On next boot, the manifest includes “restore from store hash X” hints
- Services read their saved state from the store and resume
This is application-level persistence, not kernel-level. The kernel doesn’t snapshot the capability graph — services are responsible for their own state. This avoids the complexity of EROS-style transparent persistence while still allowing stateful services.
Managed Cloud Backing
The local Store/Namespace interfaces define capOS persistence semantics. A
cloud backend must be an adapter behind those interfaces, not a new ambient
authority path. Services such as the adventure profile, expedition, and ledger
services should serialize bounded Cap’n Proto records to a store capability; the
caller should not know whether that store is backed by RAM, local disk, or a
managed cloud service.
For cloud-first application data, use a narrow bridge service:
capOS service -> Store/Namespace or app-specific SaveStore cap -> Cloud bridge
-> provider APIs
The bridge owns provider credentials and exposes only typed save/load/append operations. Ordinary clients never receive provider credentials, bucket names, database document paths, or broad write authority.
Recommended GCP mapping for game/profile style state:
- Firestore Native mode for small mutable indexes and profile summaries that need transactional compare-and-set behavior.
- Cloud Storage for larger immutable snapshots, evidence blobs, exports, and content-addressed objects. Object versioning and lifecycle policy should bound accidental overwrite recovery and storage growth.
- Cloud Run for a small HTTPS or capnp-over-HTTP bridge endpoint when capOS cannot yet link provider SDKs directly.
- Secret Manager for bridge-side service credentials and rotation; secrets do not enter ordinary capOS game clients.
Provider-specific records must still carry capOS-level schema version, content hash or release id, profile/tenant id, monotonic version, size limit, and migration policy. Writes that race on the same mutable profile or checkpoint must use an explicit version precondition and fail closed when stale. Append-only ledgers should append new records with previous-record hashes rather than rewriting history. Local QEMU tests should use a fake cloud bridge that enforces the same stale-write, append-only, wrong-profile, and size-bound rules before any real provider integration is accepted.
User-Owned Browser Transport
Some user data should be portable without giving the capOS service operator a database role over it. For private player backup/sync, a browser can act as the transport to user-owned storage:
capOS save service -> encrypted save capsule -> browser
browser OAuth/Firebase session -> Google Drive appDataFolder or Firebase user doc
This is not the same as the managed cloud bridge above. In the browser-transport
model, the user grants Drive/Firebase access to the web app, the browser writes
opaque encrypted capsules, and capOS never receives the provider tokens. The
encryption key follows the storage domain: local capOS storage uses local
capOS-host key material, while GCP-backed game-world state uses Cloud KMS
envelope encryption: a per-world or per-shard KMS KEK wraps service-owned DEKs.
Google Drive’s appDataFolder is a good fit for app-private backup files
because it is hidden from ordinary Drive views and can use the narrow
drive.appdata scope. Firebase/Firestore can also carry per-user encrypted
capsule documents and provide offline cache/sync behavior, but the backend
cannot validate encrypted game semantics beyond metadata and access rules.
Treat user-owned blobs as backup material, not authority:
- The service validates signatures, profile id, content hash, schema version, monotonic version, previous hash, and size bounds before import.
- Append-only ledgers, reward witness records, market receipts, and multiplayer outcomes remain service-owned or cloud-bridge-owned authoritative records.
- A user may delete, duplicate, or roll back private blobs; restore code must handle that as an expected input, not as trusted history.
- Game-world key capabilities, DEKs, and KMS decrypt/unwrap grants should not be exposed to the browser. For GCP-backed worlds, DEK unwrap and plaintext use are KMS/IAM-backed authority granted to the relevant game-world service. For local capOS storage, local key backup/recovery is a separate local-host policy.
For GCP-backed game-world state, provision one Cloud KMS key ring and symmetric
CryptoKey KEK per world instance or shard. This follows the CloudKmsKeySource
envelope model from the cryptography/key-management and volume-encryption
proposals: Cloud KMS wraps or unwraps DEKs, and the game-world service uses the
unwrapped DEK internally as service authority, modeled as a SymmetricKey
capability. Grant Cloud KMS roles at the CryptoKey level where possible:
roles/cloudkms.cryptoKeyEncrypter for encrypt-only writers that wrap new DEKs,
roles/cloudkms.cryptoKeyDecrypter for restore or migration paths that unwrap
existing DEKs, and roles/cloudkms.cryptoKeyEncrypterDecrypter only for the
narrow game-world service that genuinely needs both operations. Do not model
browser OAuth identities, Drive/Firebase handles, or capOS clients as holders of
DEKs or KMS decrypt/unwrap grants, and do not rely on per-key-version IAM for
this design.
Key rotation and world retirement are service operations, not browser-vault features. Rotation creates new Cloud KMS KEK versions for future DEK wrapping but does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or disable/destroy old versions. Managed re-encryption or rewrapping must unwrap the old DEK while its KEK version remains usable, decrypt and validate the capsule inside the game-world service, then write a new capsule with a new DEK or a DEK rewrapped by the current primary KEK version. Old KEK versions should only be disabled or destroyed after inventory proves no accepted wrapped DEK depends on them. Retiring a world removes IAM decrypt authority first; disabling key versions can make protected capsules inaccessible, while destruction is delayed by the scheduled destruction period and irreversible once complete, so audit retention and recovery must be settled before destruction.
Phases
Phase 1: Boot Manifest (parallel with Stage 4)
- Define
SystemManifestschema inschema/ - Build tool (
tools/mkmanifest) that compilessystem.cueinto a capnp-encoded manifest and packs it into the ISO as a Limine module - Kernel parses the manifest and now creates only the
initConfig.initprocess - Focused init-executor manifests pass the manifest to the separate
initbinary as bytes through the read-only BootPackage capability - The separate
initbinary is a generic manifest executor for the defaultsystem.cuepath and focused init-executor smokes; focused shell-led smokes still usecapos-shellasinitConfig.init - No persistent storage yet — boot image is the only data source
Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)
Depends on: IPC (Stage 6) for cross-process cap transfer. Endpoint, RECV,
RETURN, capability transfer in CALL params, and capability transfer in RETURN
results are already implemented. The BlockDevice / File / Directory /
DirEntry / Store / Namespace schema has now landed in full. The
File / Directory / Store / Namespace interfaces also have RAM-backed
kernel CapObject implementations (Phase 3 slices 1-3); BlockDevice remains
schema-only. Userspace services that export Directory / File / Store /
Namespace caps over a real backing store have since landed (Phase 3 below),
and the kernel RAM-backed caps are now qemu-only proof/fixture surface rather
than a production persistence service – see
Kernel Storage Cap Backers Are Fixtures.
That history shaped two named downstream adapters:
- POSIX adapter Phase P1.4 (vendored
dashport) does not require the userspace service for its v0 smoke: the bootstrap-granted RAM-backedDirectory+Namespacekernel caps from Phase 3 slices 1-3 are an adequate read-only in-rodata pseudo-fs backing, so P1.4 is now ready to start on the userspacelibcapos-posixfile/dir/stdio/env/printf surface and on dash vendoring; see POSIX Adapter Phase P1.4 anddocs/backlog/posix-adapter-dash-port.md. P1.3 (pipe + recording ProcessSpawner-driven fork-for-exec) landed without storage caps, so P1.4 is the next surface that consumes this proposal. - WASI host adapter Phase W.5 (Preview 1 filesystem) similarly consumes the same kernel cap shape and is unblocked from the same cap-surface perspective; remaining W.5 work is on the wasi-host adapter side. See WASI Host Adapter Phase W.5.
Concrete work:
- Add
BlockDevice,File,Directory, andDirEntrytoschema/capos.capnp, regenerate the checked-in capnp bindings, add theBLOCKDEVICE_INTERFACE_ID/FILE_INTERFACE_ID/DIRECTORY_INTERFACE_IDconstants, and add acapos-confighost roundtrip test. This was schema-only when it landed; kernelCapObjectimplementations followed in Phase 3 slices 1-3 (theStore/Namespaceinterfaces were added in slice 3).SharedBufferis not a separate interface – bulk transfers reuse the existingMemoryObjectcapability, and the inline-Dataread/write/readBlocks/writeBlocksvariants are the v0 surface. - Demo: two-process file server (in-memory File/Directory service + client) that the POSIX and WASI adapters can resolve preopens against
Phase 3: RAM-backed Store (after Phase 2)
Depends on: IPC (Stage 6) for cross-process store access. Same downstream
blockers as Phase 2 – the POSIX adapter v0 plan resolves /etc / /lib
under a read-only Namespace once this lands.
Concrete work:
- Slice 1: minimal RAM-backed
FileCapObject(kernel/src/cap/file.rs).FileCapis backed by a single in-kernelVec<u8>byte buffer and implements the inline-Datasurface of the landedFileinterface –read/write/stat/truncate/sync/close– with per-call payloads bounded at 64 KiB.close()invalidates the cap: the cap-tableget_slotpath consultsvalidate_live()(which returnsRevokedonce closed), and an in-call()guard is the defense-in-depth backup, so a post-close call fails closed with an application exception. A newKernelCapSource::filegrant source lets a manifest grant the cap; themake run-file-server-smokeQEMU smoke (demos/file-server-smoke/,system-file-server-smoke.cue) drives write/read/stat/close round-trips and asserts the closed-cap rejection. Bulk-buffer /MemoryObject-mapped variants are later slices. - Slice 2: minimal RAM-backed
DirectoryCapObject(kernel/src/cap/directory.rs).DirectoryCapis an in-memory namespace (BTreeMap<String, DirectoryEntry>, where each entry is aFileCapor a sub-DirectoryCap) implementing the landedDirectoryinterface –open/list/mkdir/remove/sub.open/mkdir/submint aFile/Directoryresult capability through the existing IPC result-cap transfer machinery (no new transfer authority); file read/write goes through the transferredFilecaps, never through theDirectory.removedeletes an entry andrevoke()s the backing object so every cap already handed out for it fails closed on its next dispatch, and refuses a non-empty sub-directory;close()invalidates the cap and recursively revokes the subtree.sub()has no attenuation beyond the structural scoping every sub-Directoryalready has – per-method read-only attenuation is deferred. A newKernelCapSource::directorygrant source lets a manifest grant the cap; themake run-directory-server-smokeQEMU smoke (demos/directory-server-smoke/,system-directory-server-smoke.cue) drives open/list/mkdir/remove/sub with cap transfer and asserts the post-remove fail-closed rejection. - Slice 3:
StoreandNamespaceinterfaces inschema/capos.capnpplus minimal RAM-backedStore/NamespacekernelCapObjects (kernel/src/cap/store.rs,kernel/src/cap/namespace.rs). The schema additions are purely additive (Store/Namespaceinterfaces and thestore @34/namespace @35KernelCapSourceordinals); theSTORE_INTERFACE_ID/NAMESPACE_INTERFACE_IDconstants and acapos-confighost roundtrip test landed alongside.StoreCapis a content-addressed blob store (BTreeMap<[u8; 32], Vec<u8>>keyed by the SHA-256 content hash fromcapos_lib::content_hash) implementingput/get/has/delete;putis idempotent for identical content, blob and count bounds keep oneStorefrom ballooning the kernel heap, anddeleteis kept on the base interface for this focused proof (theStoreAdminsplit and a GC-verified delete remain deferred – see thedeletenote above).NamespaceCapis a name->hash binding map (BTreeMap<String, Vec<u8>>for bindings plus aBTreeMap<String, Arc<NamespaceCap>>ofsubchildren) implementingresolve/bind/list/sub;bindoverwrites an existing name (mutable references are the point),sub(prefix)mints a structurally scoped child node and transfers it through the existing IPC result-cap machinery (no new transfer authority, idempotent for a repeated prefix), and the parent->child recursiverevoke()reuses the same finite-tree lock-ordering invariantDirectoryCapdocuments. The bindings are opaque hash bytes – aNamespaceCapdoes not hold aStoreCapreference or verify the hash names a live blob in this slice. NewKernelCapSource::store/KernelCapSource::namespacegrant sources let a manifest grant the caps; themake run-store-namespace-smokeQEMU smoke (demos/store-namespace-smoke/,system-store-namespace-smoke.cue) drivesStoreput/has/get/delete andNamespacebind/resolve/list/sub with cap transfer and asserts two fail-closed rejections (aStore.getof an unknown hash and aNamespace.resolveof an unbound name). - Implement
Storeas a userspace service over an exportedEndpoint, moving it out of the kernel data path: a two-process provider->consumer demo (demos/store-service/,system-userspace-store-smoke.cue,make run-userspace-store-smoke) servesput/get/has/deletefrom an in-RAMBTreeMap<[u8;32], Vec<u8>>– no kernelStorecap in the data path. It mirrors the kernelStoreCapblob-count bound and publishes a narrower 4 KiB service-specific inline blob limit because the endpoint-framed request must fit in the service receive buffer; the smoke proves the largest accepted inline blob and the first rejected over-limit blob. The client uses the stockcapos-rtStoreClientover the service endpoint relabelled toSTORE_INTERFACE_IDvia the manifestexpectedInterfaceId. Still RAM, not yet a real store. - Implement a persistent
Store+Namespaceuserspace service backed by a grantedBlockDevice, moving the durable serve boundary out of the kernel: a three-process demo (demos/storage-persist-service/,system-storage-persist-service.cue,make run-storage-persist-service) servesStore(put/get/has/delete/list) andNamespace(resolve/bind/list/sub) from a single service that owns the on-diskCAPOSUS1whole-state snapshot over a virtio-blkBlockDevice– no kernelStore/Namespacecap in the data path. The snapshot stores content-addressed blob bytes (keys recomputed and re-verified on load) and name->hash bindings; a superblock names the live snapshot length, its content hash, and a monotonic generation, and every mutation writes the new payload fully into the standby of two alternating A/B payload regions (selected by generation parity) and FLUSHes it before the single-sector superblock write flips the generation, so the previously committed snapshot survives a crash at any write boundary.Namespace.subreturns a scopedNamespacecap by pre-minting a bounded pool ofNamespace-typed service-object facets of the service’s own namespace endpoint (each a distinct receiver cookie, minted through a spawned sub-helper) and transferring one through the IPC result-cap path; scoped calls route back to the same endpoint by cookie. The client reaches both interfaces through manifest-granted service caps relabelled toSTORE_INTERFACE_ID/NAMESPACE_INTERFACE_ID, and the two-bootmake run-storage-persist-serviceproves the marker and note objects and their bindings survive a reboot (the service reloads them before the second boot writes anything) even after the harness garbages the standby payload region between the boots, simulating a commit interrupted mid payload write (torn-commit recovery proof). - Serve the result-cap-returning userspace
Directory+Filefilesystem interfaces from userspace: a three-process demo (demos/storage-fs-service/,system-storage-fs-service.cue,make run-userspace-directory-file-smoke) runs a service (the init process) that owns an in-memory filesystem tree and servesDirectory(open/list/mkdir/remove/sub/create/rename) andFile(read/write/stat/truncate/sync/close) over a single endpoint, dispatched by the call’s stamped interface id and receiver-cookie badge – no kernelreadonly_fs/writable_fs/installable_imagecap in the data path.Directory.open(-> File),mkdir/sub(-> Directory) transfer result caps from bounded pools of pre-minted typed service-object facets of the same endpoint (minted through the spawned subhelper, each a distinct cookie). The client reaches the tree through a writable root (aDirectoryclient-endpoint facet) and a read-only root (aDirectoryservice-object facet over the same tree); read-only attenuation is structural – the read-only root and the read-onlyFilehandles it returns fail mutation methods closed by routing on the cookie, not a rights flag. The proof drives the positive surface plus fail-closed cases (closed/staleFilehandle, path traversal via..//, absent paths, read-only mutation, oversize writes). The existing kernel-backed WASI filesystem smoke (make run-wasi-fs) stays green as the explicitly fixture-labeled kernelDirectory/Filepath. The follow-up cleanup retiring the kernel storage cap backers as production routes has landed – see Kernel Storage Cap Backers Are Fixtures below. - Backed by RAM (no disk driver yet, data lost on reboot)
- Backed by a real store (persistent userspace service over
BlockDevice, survives reboot) - Services can store and retrieve capnp objects at runtime
- Demonstrate the naming model with a userspace
Namespaceservice -
Namespace.sub()returns new caps via IPC cap transfer
Kernel Storage Cap Backers Are Fixtures
The kernel Store, Namespace, File, Directory, readOnlyFsRoot,
persistentStore, and writableFsRoot grant sources were the proof paths that
landed the typed storage interfaces. Now that the userspace services above own
the production serve boundary – the RAM Store service
(demos/store-service, make run-userspace-store-smoke), the disk-backed
Store + Namespace service (demos/storage-persist-service,
make run-storage-persist-service), and the Directory + File filesystem
service (demos/storage-fs-service, make run-userspace-directory-file-smoke)
– the kernel backers are explicitly proof/fixture surface, not production
storage routes. Production storage is userspace-served; no production manifest
grants kernel-owned storage state ownership (the default system.cue boot
grants none).
The kernel grant sources are gated accordingly:
- The RAM-backed
file/directory/store/namespacesources are gated behind theqemufeature in both the bootstrap cap-table builder (kernel/src/cap/mod.rs) and theProcessSpawnerspawn-grant path (kernel/src/cap/process_spawner.rs). The default non-qemuproduction kernel fails closed on these sources. They remain available only as the in-RAM pseudo-fs backing for the qemu interface proofs (make run-store-namespace-smoke,make run-file-server-smoke,make run-directory-server-smoke,make run-storage-naming) and for the POSIX/WASI/dash adapter smokes (make run-posix-*,make run-wasi-fs). - The disk-backed virtio
read_only_fs_root/persistent_store/writable_fs_rootsources (kernel/src/cap/readonly_fs.rs,persistent_store.rs,writable_fs.rs) were already gated behindqemu(withstorage_fat_read/cloud_*_over_nvme_proofvariants for the FAT and NVMe proof arms) and fail closed in the default production kernel. They back the storage regression proofsmake run-storage-fs,make run-storage-persist, andmake run-storage-writable(plus the FAT and NVMe proof targets), which stay green as explicitly fixture-labeled kernel paths.
In short: the kernel keeps these backers only as named qemu/cloud-proof fixtures; a default production build has no kernel storage grant route, so the typed storage interfaces are served from userspace.
Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)
- virtio-blk driver (userspace, reuses virtqueue infrastructure from networking smoke test)
BlockDevicetrait implementation- FAT filesystem service: wraps BlockDevice, exports Directory/File caps
- SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
- Store service uses BlockDevice for persistence (the persistent userspace
Store+Namespaceservice above,make run-storage-persist-service) - System state survives reboot via the persistent userspace store
(
make run-storage-persist-service); manifest restore hints remain future work
Phase 5: Network Store (after networking)
- Store service can replicate to or fetch from a remote store
- Capability references transparently span machines
- Directory cap backed by a remote filesystem (9P-style)
- Managed cloud bridges can back selected Store/Namespace or app-specific SaveStore capabilities without changing caller authority. First target: GCP-backed profile/ledger/snapshot storage for the adventure demo, with local fake-cloud tests and no provider credentials in ordinary clients.
- User-owned browser transport can store encrypted save capsules in Google Drive
appDataFolderor Firebase user documents. This is for private backup/sync, not authoritative shared state.
Relationship to Other Proposals
- Networking proposal — the NIC driver and net stack are services described in the manifest, not hardcoded. The store could be backed by network storage once networking works. A remote Directory cap (9P over capnp) reuses the same File/Directory interfaces.
- Service architecture proposal — the manifest replaces code-as-config for init. ProcessSpawner, supervision, and cap export work as described there, but driven by manifest data instead of compiled Rust code. IPC Endpoints are the mechanism for service export.
- Capability model — IPC cap transfer (Endpoint + RETURN SQE) is the
mechanism that makes
open()andresolve()work. SharedBuffer is the bulk data path that makes file I/O practical. Both are tracked indocs/roadmap.mdStage 6. - POSIX Adapter — Phase P1.4 (vendored
dashport) consumes theNamespace+File+Directorycap surface defined here; that surface landed as RAM-backed kernelCapObjects in Phase 3 slices 1-3 and is the v0 backing for the dash smoke’s read-only in-rodata pseudo-fs. P1.3 (recording-shim pipe + fork-for-exec) has already landed without storage caps, so P1.4 is the next adapter consumer. The POSIX path resolver,open/read/write/stat/unlink,/etcand/libpreopen scoping, and the dash port itself all sit on this proposal’s Phase 2/3 schema. - WASI Host Adapter — Phase W.5 (Preview
1 filesystem:
fd_read/fd_write/fd_seek/fd_pread/fd_pwrite/fd_filestat_get/path_open/path_filestat_get/path_unlink_file) consumes the same cap shape and is unblocked from the cap-surface side (Phase 3 slices 1-3 land the RAM-backedDirectory/Namespace/Filecaps). Preopened-dir fds map toNamespacecaps from the manifest;path_openresolves through that namespace’sStore/Filecapability. Phases W.2/W.3/W.4 (stdout, argv-grant,random_get) shipped without storage caps, so W.5 is the next adapter consumer alongside POSIX P1.4. - Userspace Binaries Parts 4 and 5 —
the POSIX adapter (Part 4) and the WASI host adapter (Part 5) both describe
their filesystem stories as translations onto this proposal’s
Namespace/Directory/File/Storesurface. Part 4 sketches theNamespace-rooted POSIX fd table and theNamespace + Store -> file I/Otranslation; Part 5 maps each preopened-dir fd to aNamespacecap. - Adventure game proposal — profile, expedition, ledger, and content persistence use application-level save records through Store/Namespace or an app-specific cloud bridge. The game should not persist by snapshotting a live process or exposing provider credentials to clients.
- Cryptography/key-management and volume-encryption proposals — the
Cloud KMS path uses envelope encryption. KMS wraps DEKs under KEKs; capOS
services use local
SymmetricKeyauthority for plaintext operations.
Open Questions
-
Manifest validation. How much can the build tool verify statically? Cap export names depend on runtime behavior of services. Should services declare their exports in their own metadata (like a package manifest)?
-
Schema evolution. When a service’s capnp interface changes, stored objects referencing the old schema need migration. Cap’n Proto has backwards-compatible schema evolution, but breaking changes need a story.
-
Garbage collection. Content-addressed store accumulates unreferenced objects. Who GCs? A separate service with
Storeread + delete authority? Reference counting in the namespace layer? -
Large objects. Storing multi-megabyte binaries as single capnp
Datafields is wasteful (capnp allocates contiguously). SharedBuffer partially addresses this for I/O, but the Store’sput/getinterface still takesData. Options: chunked storage (Merkle tree of hashes), a streamingBlobinterface, or SharedBuffer-aware Store methods. -
Trust model for the manifest. The boot manifest has full authority to define the system. Who signs it? How do you prevent a tampered ISO from granting excessive caps? Secure boot integration?
-
File locking and concurrent access. Multiple processes opening the same file through the same filesystem service need coordination. Options: mandatory locking in the filesystem service (rejects conflicting opens), advisory locking via a separate Lock capability, or single-writer enforcement at the Directory level (open with exclusive flag).
-
RETURN+RECV atomicity. When a server posts a RETURN SQE followed by a RECV SQE, there must be no window where a client call can arrive but the server isn’t listening. SQE LINK chaining (RETURN → RECV) should provide this atomicity — the kernel processes both SQEs as a unit.