# Proposal: POSIX fork/execve Full-fd-table Inheritance

Make the capOS POSIX `fork`+`execve` recording shim inherit the parent's **full
live fd table by default**, honoring close-on-exec, so unmodified POSIX software
(the dash port is the headline case) gets working stdin/stdout/stderr and an
inherited cwd in its children without the application explicitly `dup2`-ing every
descriptor. This reverses the v0 explicit-grant-only default, which is the
inverse of real POSIX semantics, while keeping the capability model's
no-ambient-authority guarantee.

## Why this is needed

capOS has no real `fork` (no address-space copy, no shared open-file
descriptions). `fork`+`execve` is emulated by a *recording shim*
(`libcapos-posix/src/process.rs`): `fork()` opens a recording window and returns
0; `dup2`/`close` between `fork` and `execve` are recorded as deferred fd
actions; `execve()` replays them against a virtual child fd-view and forwards the
resulting fds as `CapGrant`s through `ProcessSpawner.spawn`. The child
reconstructs its fd table from the named `stdio_<N>` grants
(`libcapos-posix/src/fd.rs` `inherit_stdio_grants`).

The v0 contract is **explicit-grant-only**: in `spawn_path_with_actions`, only
fd slots a recorded `dup2`/`close` *touched* become grants; untouched live slots
are deliberately not inherited (the `touched` array gate). This is the inverse of
POSIX, where a child inherits the parent's **entire** fd table across
`fork`+`execve` -- every descriptor not marked `O_CLOEXEC`/`FD_CLOEXEC` -- sharing
the underlying open file descriptions.

The consequence is decisive for arbitrary POSIX software. Vanilla dash compiled
`JOBS=0` does **not** `dup2` stdio for a foreground external command -- only the
`FORK_BG` path in `vendor/dash/src/jobs.c` (`forkchild`) manipulates fds. So
`dash -> ls-shim` replays an empty action list and hands the child an **empty
CapSet**: no stdout to print to, no `Directory` to list. This is not a dash bug;
it is correct POSIX behavior (the child is expected to inherit dash's stdio). The
v0 shim's inverted default breaks every POSIX program that relies on inheritance,
which is essentially all of them.

The project directive is explicit: do **not** solve this with per-app dash
patches ([posix-p1-4-dash-shell-smoke](../tasks/done/2026-05-27/posix-p1-4-dash-shell-smoke.md)).
A fd-inheritance fix that must be re-applied to every POSIX program is not POSIX
compatibility. The correct fix is to make the recording shim inherit the full fd
table by default, like real POSIX, reconciled with the capability model.

## Current state vs target

| Aspect | Realized (done 2026-05-27) | Notes |
|---|---|---|
| Inheritance default | full-table: every open slot forwards unless `FD_CLOEXEC` or a non-forwardable backing | `spawn_path_with_actions` walks every open parent slot; recorded `dup2`/`close` are edits on the baseline |
| `FD_CLOEXEC` | enforced: an implicitly-inherited CLOEXEC slot is dropped at `execve` forward time; `open(O_CLOEXEC)` sets the byte | an explicit recorded `dup2` keeps its child slot (POSIX `dup2` clears close-on-exec) |
| Terminal stdout | non-destructive: the recording shim forwards `TerminalSession` via `SpawnGrantMode::Raw` (`process.rs` `Terminal` arm) over the `Copy`/`SameSession` bootstrap cap | parent keeps its terminal across the spawn (proof `make run-posix-fd-inherit-default`); kernel mint proven by `make run-posix-terminal-forward` |
| Writable File/Directory | `NonTransferable` -> kernel rejects grant -> whole-spawn `ENOEXEC` | documented divergence (single-writer policy). v0 POSIX `open` mints only `Copy`/`SameSession` RAM/read-only caps, so none enters the fd table; a future writable-fs `open` path needs a pre-spawn transferability check to skip it non-fatally (follow-up) |
| Directory fd (`open("/")`) | `EISDIR`; forwardable dir fd via `dirfd(opendir())` (inherits by default under full-table) | `open(dir, O_RDONLY)` -> `FdBacking::Directory` landed (§5, `posix-open-directory-fd`); non-`O_RDONLY` stays `EISDIR` |

## Target design

### 1. Full-fd-table inheritance default

`execve()` should forward the parent's **entire live fd table** to the child,
not only touched slots. The recording shim already builds a virtual `child_view`
seeded from every open parent slot (`spawn_path_with_actions`); the change is to
remove the `touched`-only gate so the forward list is built from every
`child_view[slot] == Some(parent_slot)` entry, then apply the recorded
`dup2`/`close` actions as edits on top of that baseline. The replay order is:

1. Seed `child_view[k] = Some(k)` for every open parent slot `k` (already done).
2. Apply recorded `Dup2(src, dst)` / `Close(fd)` actions in order (already done).
3. **New:** skip any slot whose parent fd carries `FD_CLOEXEC` (see §2).
4. Build a forward for every remaining `child_view[child_slot] == Some(parent)`
   entry -- not only `touched` ones.

This makes the dash-> child case work: dash's open stdio fds (0/1/2) flow to the
child by default, exactly as POSIX requires, with no `dup2` from dash.

A subtlety the v0 forward list already half-handles: the one-parent-slot-per-
forward rule. Under full inheritance multiple child slots can legitimately map to
the same parent fd (e.g. dash's fd 0 and a child's inherited fd 0 are the same
open description). For non-destructive (Copy/Raw) backings this is fine -- the
parent keeps its cap and each child slot gets an independent Copy. For
destructive (Move) backings (`Pipe`), the existing unique-owner / one-forward
rule must hold: a single Move'd Pipe end cannot legitimately appear under two
child slot names. The forward builder must therefore Copy-share where the backing
permits and reject only the genuine Move-aliasing case, rather than the v0 blanket
"one parent slot per forward for every backing type" rule. This is the main
behavioral subtlety to get right and test.

### 2. close-on-exec enforcement

`FD_CLOEXEC` is currently stored per-fd (`fd.rs` `FD_FLAGS`) but never acted on,
because the v0 explicit-grant model has no full-table walk to enforce it against.
Under full inheritance there is now a walk: at `execve` forward-build time, a
parent slot whose `FD_FLAGS` byte has `FD_CLOEXEC` set is **not** forwarded
(equivalent to the recorded-`Close` path for that child slot). This needs a small
read API on the fd module (e.g. `fd::is_cloexec(slot)`); the `FD_FLAGS` array
already exists. `O_CLOEXEC` passed to `open()` must set the same byte at open
time so the two surfaces agree. This is the POSIX-correctness half: inherit-all
without CLOEXEC enforcement would leak descriptors a correct program expects
closed (e.g. a listening socket dash opened for itself).

### 3. The TerminalSession-stdout problem (core decision)

Real POSIX dup-inherits the controlling terminal to **all** children
non-destructively: a shell keeps its tty while every child writes to the same
tty. The kernel precursor for this is now landed: the bootstrap `TerminalSession`
cap is minted `Copy`/`SameSession` (`boot_cap_hold`, `kernel/src/cap/mod.rs`) and
forwards non-destructively via `SpawnGrantMode::Raw`, proven by
`make run-posix-terminal-forward` (a parent forwards its terminal to a child and
both write distinct lines; the parent's post-spawn write proves it kept its cap).
The remaining gap is on the POSIX side: the recording shim still forwards a
Terminal fd via destructive `Move` (`process.rs` `Terminal` arm) and must switch
to Raw under `posix-recording-shim-full-fd-inherit`. Until then, forwarding fd 1
when it is a `TerminalSession` would still strip the parent under the shim path.

**Decision (kernel mint landed): mint `TerminalSession` `Copy`/`SameSession`,
matching `Console`, so it forwards via `SpawnGrantMode::Raw` non-destructively.**
This is safe because
`TerminalSessionCap` (`kernel/src/cap/terminal_session.rs`) is a **stateless unit
struct** -- it carries no per-session ownership state; `write`/`writeLine`
dispatch onto the shared kernel terminal, and `readLine` resolves caller context
at call time (`call_with_context`). The `Move`/`ServiceRegrantOnly` choice was a
policy default, not a state-ownership requirement. Minting it `Copy`/`SameSession`
lets the parent keep its terminal cap while each child receives an independent
Copy to the same shared terminal -- which is exactly the POSIX
all-children-share-the-tty semantics, realized through the capability model
rather than against it.

Security/scope: `Copy`/`SameSession` keeps the cap from escaping the session (the
same scope `Console` already uses); a child gains no authority the parent did not
already hold (a write/read view of the same terminal it was already attached to).
`requires_live_caller_session` stays true, so the child's `readLine` still
resolves against the child's own live session context. This must be confirmed in
the kernel slice's security review, including that a forwarded terminal cap
cannot outlive the session improperly and that line-discipline interleaving of
two writers (parent + child) is acceptable for the research surface (it is: the
shared kernel terminal already serializes writes; cooked-mode interleaving at
sub-line granularity is a known, documented research-surface limitation, not a
capability leak).

Alternative considered and rejected: a separate narrower `TerminalWrite`
write-only cap (interface-is-the-permission). This is cleaner long-term but
introduces a new interface, a new bootstrap source, a new `FdBacking` variant,
and child-side adoption -- disproportionate for v0 when the existing
`TerminalSession` write surface is already the right shape and can be shared by a
mint-mode change alone. Recorded as future work if a write-only child terminal
view is later wanted.

### 4. Writable File/Directory single-writer tension

Real POSIX shares writable fds across fork (parent and child write to the same
open description). capOS's disk-backed *writable* filesystem enforces a
fail-closed single-writer policy: writable `File`/`Directory` caps are minted
`NonTransferable` (`writable_fs::transfer_result_cap`), so the kernel rejects the
spawn grant and `execve` surfaces `ENOEXEC`.

**Decision: keep writable File/Directory `NonTransferable`; document the
divergence.** Under full inheritance this means a child does **not** inherit a
parent's writable disk fd -- `execve` must treat a `NonTransferable` backing as a
non-fatal *skip* (drop that one fd from the child, like CLOEXEC) rather than a
fatal `ENOEXEC` for the whole spawn. The v0 path made it fatal because the fd was
explicitly `dup2`'d (the app asked for it); under full inheritance the fd is
inherited implicitly, so failing the entire spawn because one incidental writable
fd cannot transfer would break unrelated programs. The honest divergence: capOS
shares the *read* path of the filesystem across fork (read-only caps are
`Copy`/`SameSession`) but not the *write* path, because the single-writer policy
is a deliberate capOS guarantee that has no POSIX analog. RAM scratch
`Directory`/`File` (the `kernel:directory`/`kernel:file` sources) are
`Copy`/`SameSession` and *do* inherit, matching the common shell-scratch case.

A future revocation-aware writable share (refcounted or session-scoped) is
possible but out of scope; recorded as a follow-up. v0's stance is: writable disk
fds are not inheritable, skipped non-fatally, documented.

### 5. cwd Directory representation and inheritance

A shell's children should be able to list/open the cwd without the app doing
anything special. A forwardable directory fd is obtainable both via
`dirfd(opendir())` and, since `posix-open-directory-fd`, via
`open(dir, O_RDONLY)` (`libcapos-posix/src/file.rs`). Two parts:

- **cwd as an inheritable Directory fd.** Under full inheritance, if the shell
  holds an open `FdBacking::Directory` fd for its cwd, it forwards to the child
  by default (read-only RAM/`readonly_fs` dirs are `Copy`/`SameSession`). The
  child's libc cwd resolution can then target the inherited dir fd. This is the
  primary mechanism and needs no new surface beyond full inheritance.
- **`open(dir, O_RDONLY)` -> Directory fd (landed, `posix-open-directory-fd`).**
  `open` on a directory now installs a `FdBacking::Directory` fd instead of
  failing: `read` returns `EISDIR`, `write` returns `EBADF`, `lseek` returns
  `EISDIR`, and `fdopendir` consumes it. A non-`O_RDONLY` directory open stays
  `EISDIR`; a missing path keeps its original error (`ENOENT`). This covers the
  `N</dir` redirection path (dash redir uses `sh_open` -> `open`) without the
  bespoke `dirfd(opendir())` dance. Proof `make run-posix-open-dir-fd`. It was
  decoupled from the headline path, which never depended on it.

### 6. Backward compatibility and re-verification

Changing the default from explicit-grant-only to full-inherit interacts with the
just-landed explicit-grant contract and existing smokes. What must be re-verified
when the behavior slice lands:

- `make run-posix-pipe-smoke` -- relies on explicit pipe-end Move grants. Under
  full inheritance the parent's *other* open fds (e.g. its terminal stdio) would
  now also forward. The pipe child must still see EOF when the parent closes the
  write end, and the parent must not lose its own terminal (fixed by §3). The
  recorded `close(write_end)` still drops that child slot. **Re-verify.**
- `make run-posix-spawn-smoke` -- `posix_spawn` with explicit file actions. The
  file-actions path must still honor explicit `dup2`/`close`; full inheritance is
  the *baseline* the actions edit on top of. **Re-verify.**
- `make run-posix-execve-inherit-smoke` -- the bespoke parent that explicitly
  `dup2`s a Directory/Console. Under full inheritance the explicit `dup2`s become
  redundant (the fds would inherit anyway) but must remain correct. **Re-verify.**
- `make run-posix-stdio-smoke` / `run-posix-stdio-terminal-smoke` -- stdio
  backing selection. **Re-verify.**

The capability-purity argument is unchanged: full-inherit is **not** ambient
authority. The child inherits **exactly** the capabilities in the parent's fd
table (the same caps under the same slots), nothing more. There is no global
namespace, no inherited credential, no kernel-side fd knowledge -- the kernel
still only sees an explicit `List(CapGrant)` from `ProcessSpawner.spawn`. The
shim now *computes* that list from the full table instead of the touched subset;
the kernel's transfer-mode enforcement (`process_spawner.rs`) still gates every
grant. A child can receive only caps the parent already holds and that are
transferable; `NonTransferable` writable caps are skipped, not smuggled.

## Implementation path (decomposed)

The work splits into a kernel cap-mode slice and a libcapos-posix behavior slice,
with one optional narrow slice, all gating the dash shell smoke. See the ready
task records:

- `posix-terminal-session-forwardable` (behavior, kernel, **done 2026-05-27**) --
  mint `TerminalSession` `Copy`/`SameSession` so it forwards non-destructively via
  `SpawnGrantMode::Raw`. Precursor for the terminal-stdout half of §3. Proven by
  `make run-posix-terminal-forward`.
- `posix-recording-shim-full-fd-inherit` (behavior, libcapos-posix, **done
  2026-05-27**) -- full-table inheritance default (§1), `FD_CLOEXEC` enforcement
  (§2), non-fatal skip of non-forwardable backings (Udp / already-moved / shared
  Pipe) when implicitly inherited (§4), and Copy-share of multi-aliased
  non-destructive backings (§1 subtlety). The recording-shim `Terminal` arm now
  forwards Raw (non-destructive). Proven by `make run-posix-fd-inherit-default`.
  A `NonTransferable` writable backing stays a documented whole-spawn `ENOEXEC`
  boundary; the v0 POSIX `open` surface mints no such cap, so the §4 non-fatal
  skip is realized for the backings that can actually arise.
- `posix-open-directory-fd` (behavior, libcapos-posix, **done**) -- `open(dir,
  O_RDONLY)` -> `FdBacking::Directory` (§5); non-`O_RDONLY` stays `EISDIR`,
  missing path keeps `ENOENT`. Proof `make run-posix-open-dir-fd`. Was off the
  headline critical path.

`posix-p1-4-dash-shell-smoke` (`docs/tasks/`) depends on the first two;
once they land it can run with no per-app dash patch (only the generic, already-
landed Variant A fork-exec patch set and the slash-bearing `/ls-shim` invocation
to skip dash's PATH `stat`, which is a documented dash-config choice, not a
capOS workaround).

## Per-app patch stance

The directive forbids per-app dash patches that would have to be repeated for
every POSIX program. This design needs **none**: full inheritance is a generic
capOS-side fix in the shim. The only acceptable vendored-dash touch is a *generic
POSIX-correctness* item (the existing Variant A fork-exec coupling under
`vendor/dash/patches/`, owned by `posix-p1-4-dash-vendor`), not a per-app
inheritance workaround. The `EV_EXIT` in-place-exec residual
([posix-p1-4-dash-shell-smoke](../tasks/done/2026-05-27/posix-p1-4-dash-shell-smoke.md))
is the one remaining dash-specific item; it is a recording-shim "exec without
prior fork" limitation, handled in the shell-smoke slice (disable the
optimization or a bounded generic patch), not by this proposal.

## Design grounding

- `libcapos-posix/src/process.rs` (`spawn_path_with_actions`, `fork`, `execve`,
  the recording-shim contract), `libcapos-posix/src/fd.rs` (`FdBacking`,
  `FD_FLAGS`/`FD_CLOEXEC`, `inherit_stdio_grants`),
  `libcapos-posix/src/terminal.rs`, `libcapos-posix/src/directory.rs`,
  `libcapos-posix/src/file.rs`.
- `kernel/src/cap/mod.rs` `boot_cap_hold` (Console and TerminalSession both
  `Copy`/`SameSession` since 2026-05-27),
  `kernel/src/cap/terminal_session.rs` (`TerminalSessionCap` stateless unit
  struct), `kernel/src/cap/process_spawner.rs`
  (`validate_spawn_transfer_scope`, transfer-mode enforcement).
- `schema/capos.capnp` `ProcessSpawner.spawn(... grants :List(CapGrant))`,
  `CapGrant`, `CapGrantMode`.
- `docs/proposals/posix-adapter-proposal.md` (recording-shim Variant A,
  fd-backing-cap inheritance), `docs/capability-model.md`
  (interface-is-the-permission, transfer modes/scopes).
- `docs/tasks/done/2026-05-27/posix-execve-capability-inheritance.md` and
  `docs/tasks/done/2026-05-26/spawn-grant-forwardable-readonly-directory.md`
  (the landed explicit-grant inheritance this proposal generalizes),
  [posix-p1-4-dash-shell-smoke](../tasks/done/2026-05-27/posix-p1-4-dash-shell-smoke.md)
  (the premise conflict this resolves).
