# Real-Filesystem Decision: Role-Split, Not One Format

## Decision

capOS does **not** adopt a single general-purpose on-disk filesystem. It adopts a
**role-split** in which each storage role uses the format that fits it, behind the
same capability interfaces:

- **(A) capOS-managed data and state stays capnp-native.** Evolve the existing
  `CAPOSWF1` writable-filesystem and `CAPOSST1` persistent-store fixed layouts
  (`kernel/src/cap/writable_fs.rs`, `kernel/src/cap/persistent_store.rs`); do
  **not** replace them with a general-purpose format. These already have a
  crash-consistency proof in tree (`make run-storage-writable-recovery`), so a
  format swap would discard a tested durability story for no consumer benefit.
- **(B) Host-populated and interop images gain READ-ONLY FAT32.** Add a
  read-only FAT32 `Directory`/`File` backer over the existing `BlockDevice`,
  using the `fatfs` no_std crate. FAT32 is the one standard interop format with a
  maintained no_std read crate and zero licensing risk (the FAT long-name
  patents have expired; `fatfs` is MIT). It is already structurally part of the
  boot path -- the EFI System Partition Limine reads is FAT32
  (`docs/backlog/hardware-boot-storage.md`).
- **(C) Host tooling consolidates onto one capnp image tool.** Retire the
  per-format `tools/mkstorage-*.py` byte-offset scripts (each hand-encodes a
  fixed layout at literal offsets) in favor of one schema-driven image tool, so
  the on-disk layout has a single typed source of truth instead of N parallel
  offset hazards.

## Why the Capability Layer Is Unchanged

The `Directory`, `File`, and `Store` interfaces in `schema/capos.capnp` are the
contract; the on-disk format lives *below* them as another `CapObject` backer, so
adding FAT32 adds no schema surface and no new caller-visible behavior. The
interfaces already model every operation a format backer must answer:

These kernel backers (`readonly_fs.rs`, `writable_fs.rs`, `persistent_store.rs`,
and the RAM `file`/`directory`/`store`/`namespace` caps) are **proof/fixture
surface, not production storage routes** -- they are gated behind the `qemu`
feature (with `storage_fat_read` / `cloud_*_over_nvme_proof` variants) and fail
closed in the default production kernel. Production storage is userspace-served
by the `demos/storage-fs-service`, `demos/storage-persist-service`, and
`demos/store-service` services; see
[Kernel Storage Cap Backers Are Fixtures](storage-and-naming-proposal.md#kernel-storage-cap-backers-are-fixtures).
The role-split below still governs which on-disk format sits beneath the cap
interfaces in those proofs and in any future userspace format backer.

- `Directory`: `open @0`, `list @1`, `mkdir @2`, `remove @3`, `sub @4`,
  `create @5`, `rename @6` (`schema/capos.capnp:1824`).
- `File`: `read @0`, `write @1`, `stat @2`, `truncate @3`, `sync @4`,
  `close @5` (`schema/capos.capnp:1793`).
- `Store`: `put @0`, `get @1`, `has @2`, `delete @3` (`schema/capos.capnp:1857`).

A read-only backer answers the read/list/open/stat methods and **fails closed on
every mutation**, exactly as `readonly_fs.rs` does today
(`kernel/src/cap/readonly_fs.rs:618` rejects `mkdir`/`remove`/`sub`/`create`/
`rename`). Attenuation is structural, not a rights bitmask: a read-only `File` is
a wrapper that rejects `write`/`truncate`/`sync`, per the schema comment at
`schema/capos.capnp:1798`.

Known caveat (partially lifted): `stat`/`info` timestamps were originally
**stubbed to zero** in every filesystem backer. The Slice 4 timestamp increments
lift this for the `CAPOSWF1` writable filesystem only -- it now persists real
`created`/`modified` timestamps in the node record, carries the corresponding
`ClockProvenance` label from the same `WallClock` source, and returns the
timestamp values from `File.stat` (proof `make run-storage-writable`). The
read-only `CAPOSRO1` and `persistent_store` `CAPOSST1` backers still expose
zero/unknown timestamp state, and FAT32 read can surface real FAT
directory-entry timestamps later; those remain named Slice 4 follow-ups.

## Why Not ext4 / exFAT / littlefs / FAT-Write

- **ext4-read: deferred under an explicit trigger.** capOS reads no real
  third-party filesystem today and does not need to for boot: Limine reads the
  FAT32 ESP, the kernel image is `include_bytes!` or read from ISO 9660
  (`kernel/src/iso/`), and the cloud boot disk is a capOS-authored GPT + FAT-ESP,
  **never** a provider ext4 root. That collapses the usual "must read the
  provider's ext4 root" argument. ext4-read is deferred behind a single explicit
  trigger: *capOS must read a disk it did not format.* Until that exists, ext4's
  large read-only parser surface buys nothing.
- **ext4-write: rejected.** It would be the first writable real-disk format and
  has no crash-consistency story in tree; landing it without a recovery proof
  regresses the durability bar `CAPOSWF1` already meets.
- **exFAT: rejected.** Patent surface, no role advantage over FAT32 for the
  host-interop slot.
- **littlefs / SimpleFS: rejected.** FFI plus vendoring cost with no winning
  role -- managed state is already served by the capnp-native layouts, and
  host-interop wants a format the host actually writes (FAT32).
- **FAT-write: rejected for now.** No crash-consistency story; it would be the
  first writable format landing without a recovery proof. FAT32 stays
  **read-only** in this decision.

## Decision Matrix

Axes: host-interop fit; no_std read/write implementation cost; crash-consistency
story; capability/capnp fit; cloud-disk-read need today; licensing; available
crates.

| Format | Host-interop | no_std read / write cost | Crash-consistency | capnp fit | Cloud-disk-read need | Licensing | Crates |
|---|---|---|---|---|---|---|---|
| FAT32 (read-only) | High (host writes it; ESP already FAT32) | Read: low (fatfs) / write: out of scope | n/a (read-only) | Backer below `Directory`/`File` | n/a (capOS authors its disks) | Clean (FAT patents expired; fatfs MIT) | `fatfs` no_std |
| exFAT | Medium | High / High | n/a | Same | n/a | Patent surface | None no_std mature |
| ext4-read | Low (no consumer today) | High (large parser) / — | n/a (read-only) | Same | None today (trigger only) | Clean | None mature no_std |
| ext4-write | Low | Very high / very high | None in tree | Same | None | Clean | None mature no_std |
| littlefs / SimpleFS | Low | Medium (FFI+vendor) / medium | Has its own story | Same | None | Clean | FFI/vendor |
| capnp-native (`CAPOSWF1`/`CAPOSST1`) | None (capOS-only) | Already in tree | Proven (`run-storage-writable-recovery`) | Native | n/a | Clean | In tree |

## Phased Plan

- **Slice 0 (this doc).** Record the role-split decision and the matrix.
- **Slice 1 (landed 2026-06-02 20:59 UTC).** Vendored `fatfs` (with `VENDORED_FROM.md`,
  `vendor/fatfs-no_std/`) and added a read-only FAT32 `Directory`/`File` backer
  over virtio-blk: `kernel/src/cap/fat_fs.rs`, a `BlockStorage` adapter over the
  virtio-blk `BlockDevice` driving the vendored `fatfs` read path. Host image
  built with real `mkfs.fat` + `mcopy` (2 files, one multi-cluster). Smoke
  `make run-storage-fat-read` reads the multi-cluster file back through
  `Directory.open` -> `File.read` and asserts the bytes plus the fail-closed
  mutations. **Grant-source realization deviation:** the task text proposed a new
  `fat_fs_root` `KernelCapSource`, but `KernelCapSource` is a `schema/capos.capnp`
  enum (and `capos-config` decode) outside the task's `write_scope`. The backer
  is instead selected under a new `storage_fat_read` kernel feature on the
  existing `read_only_fs_root` source -- mirroring how that source already
  selects its `Virtio` vs NVMe backend -- so it needs no new `KernelCapSource`
  and no schema change, keeping the conflict surface disjoint from the in-flight
  NVMe graduation (which edits `readonly_fs`/`writable_fs`/`persistent_store`).
  Provenance map: [`docs/devices/fat32.md`](../devices/fat32.md). Task record:
  [`cloud-prod-fat32-readonly-over-virtio-blockdevice-local-proof`](../tasks/done/2026-06-02/cloud-prod-fat32-readonly-over-virtio-blockdevice-local-proof.md).
- **Slice 2 (landed 2026-06-03 01:44 UTC).** FAT32 read over the NVMe `BlockDevice` arm. Its
  prerequisite -- the NVMe read-arm graduation
  ([`cloud-prod-nvme-storage-graduate-readarm-local-proof`](../tasks/done/2026-06-02/cloud-prod-nvme-storage-graduate-readarm-local-proof.md))
  -- had landed, so the slice stacks on an always-built read arm rather than a
  per-proof feature: it added an `Nvme` `BlockSource` variant to `fat_fs.rs`
  (deferred mount via `FatMount`, mirroring `readonly_fs`'s NVMe arm) and proves a
  **host-authored** `mkfs.fat` image (the pre-populated NVMe medium content, no
  manager seed) read back over the graduated NVMe read arm behind the unchanged
  `Directory`/`File` cap contract. Selected by a new non-`qemu`
  `cloud_fat_read_over_nvme_proof` feature on the existing `read_only_fs_root`
  source (no new `KernelCapSource`, no schema change); its cap-waiter `Interrupt`
  route + `provider-fat-read-over-nvme` marker come from
  `kernel/src/cap/fat_read_over_nvme_proof.rs`. Because the FAT cluster-chain walk
  issues many single reads per boot, the proof raises the I/O queue depth to 64.
  Proof: `make run-cloud-provider-fat-read-over-nvme`. Task record:
  [`cloud-prod-fat32-readonly-over-nvme-blockdevice-local-proof`](../tasks/done/2026-06-03/cloud-prod-fat32-readonly-over-nvme-blockdevice-local-proof.md).
- **Slice 3 (first increment landed 2026-06-03 03:36 UTC; second increment
  landed 2026-06-03 04:08 UTC; third increment landed 2026-06-03 05:47 UTC;
  fourth increment landed 2026-06-03 08:25 UTC; seeded installable writable
  increment landed 2026-06-06 13:38 UTC at `ac0c5e2d`; final fixture retirement;
  CAPOSST1 + empty/seeded co-located CAPOSWF1 + CAPOSRO1 + NVMe-writable
  CAPOSWF1).** The host capnp image tool retired the
  hand-encoded capnp-layout Python fixtures one layout at a time. The first
  increment ported the `CAPOSST1` persistent-`Store` image producer from the
  retired byte-offset script `tools/mkstore-image.py` to a typed Rust host tool
  (`tools/mkstore-image/`, a standalone host crate built on the host target via
  `cargo test-mkstore-image`, like `tools/mkmanifest/`). Later increments added
  `--writable`, `--readonly-fs`, `--writable-nvme`, and seeded `--writable`
  modes for the empty co-located `CAPOSST1`+`CAPOSWF1` image, `CAPOSRO1`
  read-only filesystem image, fixed-size (`NVME_NAMESPACE_BLOCKS` =
  32768-block / 16 MiB) NVMe-writable `CAPOSWF1` namespace image, and
  installable-system seeded writable variants. The kernel
  `CAPOSST1`/`CAPOSWF1`/`CAPOSRO1` layouts (including `NVME_NAMESPACE_BLOCKS`),
  the `Store`/`Directory`/`File` contracts, and the disk bytes the kernel reads
  are all unchanged: the earlier migration proved byte identity against the
  retired Python outputs, and `cargo test-mkstore-image` now pins the maintained
  Rust outputs with golden byte checks. The re-pointed reboot/recovery/read-only
  proofs stay green reading the tool-produced image. The host-authored FAT image
  path (`tools/mkstorage-fat-read-image.py`) stays on real `mkfs.fat`/`mcopy`
  tooling — it is not a hand-rolled capnp byte-offset layout, so it is **not** a
  target for the typed capnp image tool. The Python capnp-layout builders have
  been retired; the Rust tool is the maintained capnp-native fixture path.
- **Slice 4 (decomposed; FAT and capnp-native increments landed in part).** capnp-native
  enhancements: real `stat` timestamps and store compaction on the managed
  layouts. The first bounded increment landed -- the `CAPOSWF1` writable
  filesystem now persists `created`/`modified` timestamps in the node record's
  reserved trailing bytes (no field moved, record stays 128 bytes, format
  version unchanged) and returns them from `File.stat`, sourced from the
  `WallClock` timebase, with the on-disk layout and the forced-poweroff
  recovery proof held byte-stable
  ([`cloud-prod-fs-capnp-native-stat-timestamps-local-proof`](../tasks/done/2026-06-03/cloud-prod-fs-capnp-native-stat-timestamps-local-proof.md),
  proofs `make run-storage-writable` / `make run-storage-writable-recovery`).
  The provenance increment threads the same `WallClock` source into the
  writable backer and uses the node-record provenance bytes to carry the
  `ClockProvenance` label alongside `created`/`modified`; `File.stat` remains
  schema-stable and the local proof records the stored labels through the
  storage smoke log.
  The FAT increment now surfaces valid FAT directory-entry `created`/`modified`
  values from the host-authored read-only image through the same schema-stable
  `File.stat` fields over both virtio-blk and NVMe. The proof logs distinguish
  `metadata_provenance=fat-directory-entry` from `CAPOSWF1`'s `WallClock`
  provenance and keep FAT's timezone-free/two-second-modified-time limits
  explicit.
  The second bounded increment landed `CAPOSST1` persistent-`Store` compaction:
  when a new `put` would exhaust the entry table or data cursor and tombstones
  exist, the kernel rewrites live entries through a shadow generation before
  recommitting the canonical front generation; `make run-storage-persist` proves
  pre-compaction write, delete/tombstone, compaction-triggered write, reboot,
  post-reboot reads, and tombstone absence
  ([`storage-caposst1-store-compaction-local-proof`](../tasks/done/2026-06-06/storage-caposst1-store-compaction-local-proof.md)).
  Remaining follow-ups: timestamps and timestamp provenance on the other
  managed/read-only layouts (`CAPOSST1` `Store`, `CAPOSRO1`).
- **Slice 5 (deferred).** ext4-read, only once the explicit trigger ("must read
  a disk capOS did not format") materializes.

## Relationship to the NVMe Graduation

The NVMe BlockDevice graduation and real-FS work are **stacked, not competing**:

- The graduation sits **below** `BlockDevice` -- it moves the NVMe
  read/write/flush arms into always-built production behind fail-closed runtime
  probes (`cloud-prod-nvme-storage-graduate-readarm-local-proof`).
- Real-FS sits **above** `BlockDevice` -- it adds new `CapObject` backers
  (`fat_fs.rs`) that read through whatever `BlockDevice` provides.

Slice 1 deliberately reads over **virtio-blk** and adds a **new file**, so its
conflict surface is disjoint from the graduation's edits to the existing storage
modules. Slice 2 is the join point, sequenced after the graduation landed: it
**consumes** the always-built NVMe read arm (it does not modify it) by adding the
`Nvme` `BlockSource` arm to the same `fat_fs.rs`.

## Design Grounding

- `kernel/src/cap/readonly_fs.rs` -- the read-only `Directory`/`File` over
  `BlockSource` pattern Slice 1 mirrors, including the fail-closed mutation arm.
- `kernel/src/cap/writable_fs.rs`, `kernel/src/cap/persistent_store.rs` -- the
  capnp-native managed layouts (`CAPOSWF1`/`CAPOSST1`) the decision evolves
  rather than replaces.
- `schema/capos.capnp` -- the `Directory`/`File`/`Store` contract the format
  backers serve.
- `docs/backlog/hardware-boot-storage.md` -- the storage track and the FAT32
  ESP/GPT boot-disk facts that collapse the ext4 argument.
