Real-Filesystem Decision: Role-Split, Not One Format
Decision
capOS does not adopt a single general-purpose on-disk filesystem. It adopts a role-split in which each storage role uses the format that fits it, behind the same capability interfaces:
- (A) capOS-managed data and state stays capnp-native. Evolve the existing
CAPOSWF1writable-filesystem andCAPOSST1persistent-store fixed layouts (kernel/src/cap/writable_fs.rs,kernel/src/cap/persistent_store.rs); do not replace them with a general-purpose format. These already have a crash-consistency proof in tree (make run-storage-writable-recovery), so a format swap would discard a tested durability story for no consumer benefit. - (B) Host-populated and interop images gain READ-ONLY FAT32. Add a
read-only FAT32
Directory/Filebacker over the existingBlockDevice, using thefatfsno_std crate. FAT32 is the one standard interop format with a maintained no_std read crate and zero licensing risk (the FAT long-name patents have expired;fatfsis MIT). It is already structurally part of the boot path – the EFI System Partition Limine reads is FAT32 (docs/backlog/hardware-boot-storage.md). - (C) Host tooling consolidates onto one capnp image tool. Retire the
per-format
tools/mkstorage-*.pybyte-offset scripts (each hand-encodes a fixed layout at literal offsets) in favor of one schema-driven image tool, so the on-disk layout has a single typed source of truth instead of N parallel offset hazards.
Why the Capability Layer Is Unchanged
The Directory, File, and Store interfaces in schema/capos.capnp are the
contract; the on-disk format lives below them as another CapObject backer, so
adding FAT32 adds no schema surface and no new caller-visible behavior. The
interfaces already model every operation a format backer must answer:
These kernel backers (readonly_fs.rs, writable_fs.rs, persistent_store.rs,
and the RAM file/directory/store/namespace caps) are proof/fixture
surface, not production storage routes – they are gated behind the qemu
feature (with storage_fat_read / cloud_*_over_nvme_proof variants) and fail
closed in the default production kernel. Production storage is userspace-served
by the demos/storage-fs-service, demos/storage-persist-service, and
demos/store-service services; see
Kernel Storage Cap Backers Are Fixtures.
The role-split below still governs which on-disk format sits beneath the cap
interfaces in those proofs and in any future userspace format backer.
Directory:open @0,list @1,mkdir @2,remove @3,sub @4,create @5,rename @6(schema/capos.capnp:1824).File:read @0,write @1,stat @2,truncate @3,sync @4,close @5(schema/capos.capnp:1793).Store:put @0,get @1,has @2,delete @3(schema/capos.capnp:1857).
A read-only backer answers the read/list/open/stat methods and fails closed on
every mutation, exactly as readonly_fs.rs does today
(kernel/src/cap/readonly_fs.rs:618 rejects mkdir/remove/sub/create/
rename). Attenuation is structural, not a rights bitmask: a read-only File is
a wrapper that rejects write/truncate/sync, per the schema comment at
schema/capos.capnp:1798.
Known caveat (partially lifted): stat/info timestamps were originally
stubbed to zero in every filesystem backer. The Slice 4 timestamp increments
lift this for the CAPOSWF1 writable filesystem only – it now persists real
created/modified timestamps in the node record, carries the corresponding
ClockProvenance label from the same WallClock source, and returns the
timestamp values from File.stat (proof make run-storage-writable). The
read-only CAPOSRO1 and persistent_store CAPOSST1 backers still expose
zero/unknown timestamp state, and FAT32 read can surface real FAT
directory-entry timestamps later; those remain named Slice 4 follow-ups.
Why Not ext4 / exFAT / littlefs / FAT-Write
- ext4-read: deferred under an explicit trigger. capOS reads no real
third-party filesystem today and does not need to for boot: Limine reads the
FAT32 ESP, the kernel image is
include_bytes!or read from ISO 9660 (kernel/src/iso/), and the cloud boot disk is a capOS-authored GPT + FAT-ESP, never a provider ext4 root. That collapses the usual “must read the provider’s ext4 root” argument. ext4-read is deferred behind a single explicit trigger: capOS must read a disk it did not format. Until that exists, ext4’s large read-only parser surface buys nothing. - ext4-write: rejected. It would be the first writable real-disk format and
has no crash-consistency story in tree; landing it without a recovery proof
regresses the durability bar
CAPOSWF1already meets. - exFAT: rejected. Patent surface, no role advantage over FAT32 for the host-interop slot.
- littlefs / SimpleFS: rejected. FFI plus vendoring cost with no winning role – managed state is already served by the capnp-native layouts, and host-interop wants a format the host actually writes (FAT32).
- FAT-write: rejected for now. No crash-consistency story; it would be the first writable format landing without a recovery proof. FAT32 stays read-only in this decision.
Decision Matrix
Axes: host-interop fit; no_std read/write implementation cost; crash-consistency story; capability/capnp fit; cloud-disk-read need today; licensing; available crates.
| Format | Host-interop | no_std read / write cost | Crash-consistency | capnp fit | Cloud-disk-read need | Licensing | Crates |
|---|---|---|---|---|---|---|---|
| FAT32 (read-only) | High (host writes it; ESP already FAT32) | Read: low (fatfs) / write: out of scope | n/a (read-only) | Backer below Directory/File | n/a (capOS authors its disks) | Clean (FAT patents expired; fatfs MIT) | fatfs no_std |
| exFAT | Medium | High / High | n/a | Same | n/a | Patent surface | None no_std mature |
| ext4-read | Low (no consumer today) | High (large parser) / — | n/a (read-only) | Same | None today (trigger only) | Clean | None mature no_std |
| ext4-write | Low | Very high / very high | None in tree | Same | None | Clean | None mature no_std |
| littlefs / SimpleFS | Low | Medium (FFI+vendor) / medium | Has its own story | Same | None | Clean | FFI/vendor |
capnp-native (CAPOSWF1/CAPOSST1) | None (capOS-only) | Already in tree | Proven (run-storage-writable-recovery) | Native | n/a | Clean | In tree |
Phased Plan
- Slice 0 (this doc). Record the role-split decision and the matrix.
- Slice 1 (landed 2026-06-02 20:59 UTC). Vendored
fatfs(withVENDORED_FROM.md,vendor/fatfs-no_std/) and added a read-only FAT32Directory/Filebacker over virtio-blk:kernel/src/cap/fat_fs.rs, aBlockStorageadapter over the virtio-blkBlockDevicedriving the vendoredfatfsread path. Host image built with realmkfs.fat+mcopy(2 files, one multi-cluster). Smokemake run-storage-fat-readreads the multi-cluster file back throughDirectory.open->File.readand asserts the bytes plus the fail-closed mutations. Grant-source realization deviation: the task text proposed a newfat_fs_rootKernelCapSource, butKernelCapSourceis aschema/capos.capnpenum (andcapos-configdecode) outside the task’swrite_scope. The backer is instead selected under a newstorage_fat_readkernel feature on the existingread_only_fs_rootsource – mirroring how that source already selects itsVirtiovs NVMe backend – so it needs no newKernelCapSourceand no schema change, keeping the conflict surface disjoint from the in-flight NVMe graduation (which editsreadonly_fs/writable_fs/persistent_store). Provenance map: FAT32 (read-only backer). Task record:cloud-prod-fat32-readonly-over-virtio-blockdevice-local-proof. - Slice 2 (landed 2026-06-03 01:44 UTC). FAT32 read over the NVMe
BlockDevicearm. Its prerequisite – the NVMe read-arm graduation (cloud-prod-nvme-storage-graduate-readarm-local-proof) – had landed, so the slice stacks on an always-built read arm rather than a per-proof feature: it added anNvmeBlockSourcevariant tofat_fs.rs(deferred mount viaFatMount, mirroringreadonly_fs’s NVMe arm) and proves a host-authoredmkfs.fatimage (the pre-populated NVMe medium content, no manager seed) read back over the graduated NVMe read arm behind the unchangedDirectory/Filecap contract. Selected by a new non-qemucloud_fat_read_over_nvme_prooffeature on the existingread_only_fs_rootsource (no newKernelCapSource, no schema change); its cap-waiterInterruptroute +provider-fat-read-over-nvmemarker come fromkernel/src/cap/fat_read_over_nvme_proof.rs. Because the FAT cluster-chain walk issues many single reads per boot, the proof raises the I/O queue depth to 64. Proof:make run-cloud-provider-fat-read-over-nvme. Task record:cloud-prod-fat32-readonly-over-nvme-blockdevice-local-proof. - Slice 3 (first increment landed 2026-06-03 03:36 UTC; second increment
landed 2026-06-03 04:08 UTC; third increment landed 2026-06-03 05:47 UTC;
fourth increment landed 2026-06-03 08:25 UTC; seeded installable writable
increment landed 2026-06-06 13:38 UTC at
ac0c5e2d; final fixture retirement; CAPOSST1 + empty/seeded co-located CAPOSWF1 + CAPOSRO1 + NVMe-writable CAPOSWF1). The host capnp image tool retired the hand-encoded capnp-layout Python fixtures one layout at a time. The first increment ported theCAPOSST1persistent-Storeimage producer from the retired byte-offset scripttools/mkstore-image.pyto a typed Rust host tool (tools/mkstore-image/, a standalone host crate built on the host target viacargo test-mkstore-image, liketools/mkmanifest/). Later increments added--writable,--readonly-fs,--writable-nvme, and seeded--writablemodes for the empty co-locatedCAPOSST1+CAPOSWF1image,CAPOSRO1read-only filesystem image, fixed-size (NVME_NAMESPACE_BLOCKS= 32768-block / 16 MiB) NVMe-writableCAPOSWF1namespace image, and installable-system seeded writable variants. The kernelCAPOSST1/CAPOSWF1/CAPOSRO1layouts (includingNVME_NAMESPACE_BLOCKS), theStore/Directory/Filecontracts, and the disk bytes the kernel reads are all unchanged: the earlier migration proved byte identity against the retired Python outputs, andcargo test-mkstore-imagenow pins the maintained Rust outputs with golden byte checks. The re-pointed reboot/recovery/read-only proofs stay green reading the tool-produced image. The host-authored FAT image path (tools/mkstorage-fat-read-image.py) stays on realmkfs.fat/mcopytooling — it is not a hand-rolled capnp byte-offset layout, so it is not a target for the typed capnp image tool. The Python capnp-layout builders have been retired; the Rust tool is the maintained capnp-native fixture path. - Slice 4 (decomposed; FAT and capnp-native increments landed in part). capnp-native
enhancements: real
stattimestamps and store compaction on the managed layouts. The first bounded increment landed – theCAPOSWF1writable filesystem now persistscreated/modifiedtimestamps in the node record’s reserved trailing bytes (no field moved, record stays 128 bytes, format version unchanged) and returns them fromFile.stat, sourced from theWallClocktimebase, with the on-disk layout and the forced-poweroff recovery proof held byte-stable (cloud-prod-fs-capnp-native-stat-timestamps-local-proof, proofsmake run-storage-writable/make run-storage-writable-recovery). The provenance increment threads the sameWallClocksource into the writable backer and uses the node-record provenance bytes to carry theClockProvenancelabel alongsidecreated/modified;File.statremains schema-stable and the local proof records the stored labels through the storage smoke log. The FAT increment now surfaces valid FAT directory-entrycreated/modifiedvalues from the host-authored read-only image through the same schema-stableFile.statfields over both virtio-blk and NVMe. The proof logs distinguishmetadata_provenance=fat-directory-entryfromCAPOSWF1’sWallClockprovenance and keep FAT’s timezone-free/two-second-modified-time limits explicit. The second bounded increment landedCAPOSST1persistent-Storecompaction: when a newputwould exhaust the entry table or data cursor and tombstones exist, the kernel rewrites live entries through a shadow generation before recommitting the canonical front generation;make run-storage-persistproves pre-compaction write, delete/tombstone, compaction-triggered write, reboot, post-reboot reads, and tombstone absence (storage-caposst1-store-compaction-local-proof). Remaining follow-ups: timestamps and timestamp provenance on the other managed/read-only layouts (CAPOSST1Store,CAPOSRO1). - Slice 5 (deferred). ext4-read, only once the explicit trigger (“must read a disk capOS did not format”) materializes.
Relationship to the NVMe Graduation
The NVMe BlockDevice graduation and real-FS work are stacked, not competing:
- The graduation sits below
BlockDevice– it moves the NVMe read/write/flush arms into always-built production behind fail-closed runtime probes (cloud-prod-nvme-storage-graduate-readarm-local-proof). - Real-FS sits above
BlockDevice– it adds newCapObjectbackers (fat_fs.rs) that read through whateverBlockDeviceprovides.
Slice 1 deliberately reads over virtio-blk and adds a new file, so its
conflict surface is disjoint from the graduation’s edits to the existing storage
modules. Slice 2 is the join point, sequenced after the graduation landed: it
consumes the always-built NVMe read arm (it does not modify it) by adding the
Nvme BlockSource arm to the same fat_fs.rs.
Design Grounding
kernel/src/cap/readonly_fs.rs– the read-onlyDirectory/FileoverBlockSourcepattern Slice 1 mirrors, including the fail-closed mutation arm.kernel/src/cap/writable_fs.rs,kernel/src/cap/persistent_store.rs– the capnp-native managed layouts (CAPOSWF1/CAPOSST1) the decision evolves rather than replaces.schema/capos.capnp– theDirectory/File/Storecontract the format backers serve.docs/backlog/hardware-boot-storage.md– the storage track and the FAT32 ESP/GPT boot-disk facts that collapse the ext4 argument.