Real-Filesystem Decision: Role-Split, Not One Format

Decision

capOS does not adopt a single general-purpose on-disk filesystem. It adopts a role-split in which each storage role uses the format that fits it, behind the same capability interfaces:

(A) capOS-managed data and state stays capnp-native. Evolve the existing CAPOSWF1 writable-filesystem and CAPOSST1 persistent-store fixed layouts (kernel/src/cap/writable_fs.rs, kernel/src/cap/persistent_store.rs); do not replace them with a general-purpose format. These already have a crash-consistency proof in tree (make run-storage-writable-recovery), so a format swap would discard a tested durability story for no consumer benefit.
(B) Host-populated and interop images gain READ-ONLY FAT32. Add a read-only FAT32 Directory/File backer over the existing BlockDevice, using the fatfs no_std crate. FAT32 is the one standard interop format with a maintained no_std read crate and zero licensing risk (the FAT long-name patents have expired; fatfs is MIT). It is already structurally part of the boot path – the EFI System Partition Limine reads is FAT32 (docs/backlog/hardware-boot-storage.md).
(C) Host tooling consolidates onto one capnp image tool. Retire the per-format tools/mkstorage-*.py byte-offset scripts (each hand-encodes a fixed layout at literal offsets) in favor of one schema-driven image tool, so the on-disk layout has a single typed source of truth instead of N parallel offset hazards.

Why the Capability Layer Is Unchanged

The Directory, File, and Store interfaces in schema/capos.capnp are the contract; the on-disk format lives below them as another CapObject backer, so adding FAT32 adds no schema surface and no new caller-visible behavior. The interfaces already model every operation a format backer must answer:

These kernel backers (readonly_fs.rs, writable_fs.rs, persistent_store.rs, and the RAM file/directory/store/namespace caps) are proof/fixture surface, not production storage routes – they are gated behind the qemu feature (with storage_fat_read / cloud_*_over_nvme_proof variants) and fail closed in the default production kernel. Production storage is userspace-served by the demos/storage-fs-service, demos/storage-persist-service, and demos/store-service services; see Kernel Storage Cap Backers Are Fixtures. The role-split below still governs which on-disk format sits beneath the cap interfaces in those proofs and in any future userspace format backer.

Directory: open @0, list @1, mkdir @2, remove @3, sub @4, create @5, rename @6 (schema/capos.capnp:1824).
File: read @0, write @1, stat @2, truncate @3, sync @4, close @5 (schema/capos.capnp:1793).
Store: put @0, get @1, has @2, delete @3 (schema/capos.capnp:1857).

A read-only backer answers the read/list/open/stat methods and fails closed on every mutation, exactly as readonly_fs.rs does today (kernel/src/cap/readonly_fs.rs:618 rejects mkdir/remove/sub/create/ rename). Attenuation is structural, not a rights bitmask: a read-only File is a wrapper that rejects write/truncate/sync, per the schema comment at schema/capos.capnp:1798.

Known caveat (partially lifted): stat/info timestamps were originally stubbed to zero in every filesystem backer. The Slice 4 timestamp increments lift this for the CAPOSWF1 writable filesystem only – it now persists real created/modified timestamps in the node record, carries the corresponding ClockProvenance label from the same WallClock source, and returns the timestamp values from File.stat (proof make run-storage-writable). The read-only CAPOSRO1 and persistent_store CAPOSST1 backers still expose zero/unknown timestamp state, and FAT32 read can surface real FAT directory-entry timestamps later; those remain named Slice 4 follow-ups.

Why Not ext4 / exFAT / littlefs / FAT-Write

ext4-read: deferred under an explicit trigger. capOS reads no real third-party filesystem today and does not need to for boot: Limine reads the FAT32 ESP, the kernel image is include_bytes! or read from ISO 9660 (kernel/src/iso/), and the cloud boot disk is a capOS-authored GPT + FAT-ESP, never a provider ext4 root. That collapses the usual “must read the provider’s ext4 root” argument. ext4-read is deferred behind a single explicit trigger: capOS must read a disk it did not format. Until that exists, ext4’s large read-only parser surface buys nothing.
ext4-write: rejected. It would be the first writable real-disk format and has no crash-consistency story in tree; landing it without a recovery proof regresses the durability bar CAPOSWF1 already meets.
exFAT: rejected. Patent surface, no role advantage over FAT32 for the host-interop slot.
littlefs / SimpleFS: rejected. FFI plus vendoring cost with no winning role – managed state is already served by the capnp-native layouts, and host-interop wants a format the host actually writes (FAT32).
FAT-write: rejected for now. No crash-consistency story; it would be the first writable format landing without a recovery proof. FAT32 stays read-only in this decision.

Decision Matrix

Axes: host-interop fit; no_std read/write implementation cost; crash-consistency story; capability/capnp fit; cloud-disk-read need today; licensing; available crates.

Format	Host-interop	no_std read / write cost	Crash-consistency	capnp fit	Cloud-disk-read need	Licensing	Crates
FAT32 (read-only)	High (host writes it; ESP already FAT32)	Read: low (fatfs) / write: out of scope	n/a (read-only)	Backer below `Directory`/`File`	n/a (capOS authors its disks)	Clean (FAT patents expired; fatfs MIT)	`fatfs` no_std
exFAT	Medium	High / High	n/a	Same	n/a	Patent surface	None no_std mature
ext4-read	Low (no consumer today)	High (large parser) / —	n/a (read-only)	Same	None today (trigger only)	Clean	None mature no_std
ext4-write	Low	Very high / very high	None in tree	Same	None	Clean	None mature no_std
littlefs / SimpleFS	Low	Medium (FFI+vendor) / medium	Has its own story	Same	None	Clean	FFI/vendor
capnp-native (`CAPOSWF1`/`CAPOSST1`)	None (capOS-only)	Already in tree	Proven (`run-storage-writable-recovery`)	Native	n/a	Clean	In tree

Phased Plan

Slice 0 (this doc). Record the role-split decision and the matrix.
Slice 1 (landed 2026-06-02 20:59 UTC). Vendored fatfs (with VENDORED_FROM.md, vendor/fatfs-no_std/) and added a read-only FAT32 Directory/File backer over virtio-blk: kernel/src/cap/fat_fs.rs, a BlockStorage adapter over the virtio-blk BlockDevice driving the vendored fatfs read path. Host image built with real mkfs.fat + mcopy (2 files, one multi-cluster). Smoke make run-storage-fat-read reads the multi-cluster file back through Directory.open -> File.read and asserts the bytes plus the fail-closed mutations. Grant-source realization deviation: the task text proposed a new fat_fs_root KernelCapSource, but KernelCapSource is a schema/capos.capnp enum (and capos-config decode) outside the task’s write_scope. The backer is instead selected under a new storage_fat_read kernel feature on the existing read_only_fs_root source – mirroring how that source already selects its Virtio vs NVMe backend – so it needs no new KernelCapSource and no schema change, keeping the conflict surface disjoint from the in-flight NVMe graduation (which edits readonly_fs/writable_fs/persistent_store). Provenance map: FAT32 (read-only backer). Task record: cloud-prod-fat32-readonly-over-virtio-blockdevice-local-proof.
Slice 2 (landed 2026-06-03 01:44 UTC). FAT32 read over the NVMe BlockDevice arm. Its prerequisite – the NVMe read-arm graduation (cloud-prod-nvme-storage-graduate-readarm-local-proof) – had landed, so the slice stacks on an always-built read arm rather than a per-proof feature: it added an Nvme BlockSource variant to fat_fs.rs (deferred mount via FatMount, mirroring readonly_fs’s NVMe arm) and proves a host-authored mkfs.fat image (the pre-populated NVMe medium content, no manager seed) read back over the graduated NVMe read arm behind the unchanged Directory/File cap contract. Selected by a new non-qemu cloud_fat_read_over_nvme_proof feature on the existing read_only_fs_root source (no new KernelCapSource, no schema change); its cap-waiter Interrupt route + provider-fat-read-over-nvme marker come from kernel/src/cap/fat_read_over_nvme_proof.rs. Because the FAT cluster-chain walk issues many single reads per boot, the proof raises the I/O queue depth to 64. Proof: make run-cloud-provider-fat-read-over-nvme. Task record: cloud-prod-fat32-readonly-over-nvme-blockdevice-local-proof.
Slice 3 (first increment landed 2026-06-03 03:36 UTC; second increment landed 2026-06-03 04:08 UTC; third increment landed 2026-06-03 05:47 UTC; fourth increment landed 2026-06-03 08:25 UTC; seeded installable writable increment landed 2026-06-06 13:38 UTC at ac0c5e2d; final fixture retirement; CAPOSST1 + empty/seeded co-located CAPOSWF1 + CAPOSRO1 + NVMe-writable CAPOSWF1). The host capnp image tool retired the hand-encoded capnp-layout Python fixtures one layout at a time. The first increment ported the CAPOSST1 persistent-Store image producer from the retired byte-offset script tools/mkstore-image.py to a typed Rust host tool (tools/mkstore-image/, a standalone host crate built on the host target via cargo test-mkstore-image, like tools/mkmanifest/). Later increments added --writable, --readonly-fs, --writable-nvme, and seeded --writable modes for the empty co-located CAPOSST1+CAPOSWF1 image, CAPOSRO1 read-only filesystem image, fixed-size (NVME_NAMESPACE_BLOCKS = 32768-block / 16 MiB) NVMe-writable CAPOSWF1 namespace image, and installable-system seeded writable variants. The kernel CAPOSST1/CAPOSWF1/CAPOSRO1 layouts (including NVME_NAMESPACE_BLOCKS), the Store/Directory/File contracts, and the disk bytes the kernel reads are all unchanged: the earlier migration proved byte identity against the retired Python outputs, and cargo test-mkstore-image now pins the maintained Rust outputs with golden byte checks. The re-pointed reboot/recovery/read-only proofs stay green reading the tool-produced image. The host-authored FAT image path (tools/mkstorage-fat-read-image.py) stays on real mkfs.fat/mcopy tooling — it is not a hand-rolled capnp byte-offset layout, so it is not a target for the typed capnp image tool. The Python capnp-layout builders have been retired; the Rust tool is the maintained capnp-native fixture path.
Slice 4 (decomposed; FAT and capnp-native increments landed in part). capnp-native enhancements: real stat timestamps and store compaction on the managed layouts. The first bounded increment landed – the CAPOSWF1 writable filesystem now persists created/modified timestamps in the node record’s reserved trailing bytes (no field moved, record stays 128 bytes, format version unchanged) and returns them from File.stat, sourced from the WallClock timebase, with the on-disk layout and the forced-poweroff recovery proof held byte-stable (cloud-prod-fs-capnp-native-stat-timestamps-local-proof, proofs make run-storage-writable / make run-storage-writable-recovery). The provenance increment threads the same WallClock source into the writable backer and uses the node-record provenance bytes to carry the ClockProvenance label alongside created/modified; File.stat remains schema-stable and the local proof records the stored labels through the storage smoke log. The FAT increment now surfaces valid FAT directory-entry created/modified values from the host-authored read-only image through the same schema-stable File.stat fields over both virtio-blk and NVMe. The proof logs distinguish metadata_provenance=fat-directory-entry from CAPOSWF1’s WallClock provenance and keep FAT’s timezone-free/two-second-modified-time limits explicit. The second bounded increment landed CAPOSST1 persistent-Store compaction: when a new put would exhaust the entry table or data cursor and tombstones exist, the kernel rewrites live entries through a shadow generation before recommitting the canonical front generation; make run-storage-persist proves pre-compaction write, delete/tombstone, compaction-triggered write, reboot, post-reboot reads, and tombstone absence (storage-caposst1-store-compaction-local-proof). Remaining follow-ups: timestamps and timestamp provenance on the other managed/read-only layouts (CAPOSST1 Store, CAPOSRO1).
Slice 5 (deferred). ext4-read, only once the explicit trigger (“must read a disk capOS did not format”) materializes.

Relationship to the NVMe Graduation

The NVMe BlockDevice graduation and real-FS work are stacked, not competing:

The graduation sits below BlockDevice – it moves the NVMe read/write/flush arms into always-built production behind fail-closed runtime probes (cloud-prod-nvme-storage-graduate-readarm-local-proof).
Real-FS sits above BlockDevice – it adds new CapObject backers (fat_fs.rs) that read through whatever BlockDevice provides.

Slice 1 deliberately reads over virtio-blk and adds a new file, so its conflict surface is disjoint from the graduation’s edits to the existing storage modules. Slice 2 is the join point, sequenced after the graduation landed: it consumes the always-built NVMe read arm (it does not modify it) by adding the Nvme BlockSource arm to the same fat_fs.rs.

Design Grounding

kernel/src/cap/readonly_fs.rs – the read-only Directory/File over BlockSource pattern Slice 1 mirrors, including the fail-closed mutation arm.
kernel/src/cap/writable_fs.rs, kernel/src/cap/persistent_store.rs – the capnp-native managed layouts (CAPOSWF1/CAPOSST1) the decision evolves rather than replaces.
schema/capos.capnp – the Directory/File/Store contract the format backers serve.
docs/backlog/hardware-boot-storage.md – the storage track and the FAT32 ESP/GPT boot-disk facts that collapse the ext4 argument.

Keyboard shortcuts

capOS Documentation