# Proposal: Capability-Based Service Architecture

How capOS processes receive authority, compose into services, and expose
layered capabilities — without a service manager daemon.


## Problem

Traditional OSes grant processes ambient authority (file system, network, IPC
namespaces) and then restrict it via sandboxing (seccomp, namespaces, AppArmor).
Service managers like systemd handle dependencies, lifecycle, and resource
limits through a central daemon with a massive configuration surface.

capOS inverts this: processes start with zero authority and receive only the
capabilities they need. The capability graph implicitly encodes service
dependencies, resource limits, and access control. No central daemon required.

## Process Startup Model

A process receives its entire authority as a set of named capabilities at
spawn time. There is no ambient authority to fall back on — if a capability
wasn't granted, the operation is impossible.

The child process sees its granted capabilities by name. It cannot discover or
request capabilities it wasn't given.

## Capability Layering

Each process consumes lower-level capabilities and exports higher-level ones.
Authority narrows at every layer:

```
Kernel
  │
  ├─ Nic cap (raw frame send/receive for one device)
  ├─ Timer cap (monotonic clock)
  ├─ DeviceMmio cap (one device's BAR regions)
  └─ Interrupt cap (one IRQ line)
       │
       v
NIC Driver Process
  │
  └─ Nic cap ──> Network Stack Process
                   │
                   ├─ TcpSocket cap (one connection)
                   ├─ UdpSocket cap (one socket)
                   └─ NetworkManager cap (create sockets)
                        │
                        v
                   HTTP Service Process
                     │
                     ├─ Fetch cap (any URL)
                     │    │
                     │    v
                     │  Trusted Process (holds Fetch, mints scoped caps)
                     │
                     └─ HttpEndpoint cap (one origin)
                          │
                          v
                     Application Process
```

The application at the bottom holds an `HttpEndpoint` cap scoped to a single
origin. It cannot make raw TCP connections, send arbitrary packets, or touch
any device. The capability *is* the security policy.

## HTTP Capabilities

Two levels of HTTP capability: `Fetch` (general) and `HttpEndpoint` (scoped).
`HttpEndpoint` is implemented by a process that holds a `Fetch` cap and
restricts it.

### Fetch

Unrestricted HTTP access — equivalent to the browser Fetch API. The holder
can make requests to any URL. This is the base capability that HTTP service
processes use internally.

```capnp
interface Fetch {
    # General-purpose HTTP request to any URL.
    request @0 (url :Text, method :Text, headers :List(Header), body :Data)
        -> (status :UInt16, headers :List(Header), body :Data);
}

struct Header {
    name @0 :Text;
    value @1 :Text;
}
```

`Fetch` is powerful — granting it is roughly equivalent to granting arbitrary
outbound network access. It should only be held by service processes that need
to make requests on behalf of others, not by application code directly.

### HttpEndpoint

A restricted view of `Fetch`, scoped to a single origin. The holder can only
make requests within the bounds encoded in the capability.

```capnp
interface HttpEndpoint {
    # Request scoped to this endpoint's origin.
    # Path is relative (e.g., "/v1/users").
    request @0 (method :Text, path :Text, headers :List(Header), body :Data)
        -> (status :UInt16, headers :List(Header), body :Data);
}
```

Note: same `request()` signature as `Fetch`, but `path` instead of `url`.
The origin is implicit — bound into the capability at mint time.

### Attenuation

A process holding `Fetch` mints `HttpEndpoint` caps by narrowing authority.
The core restriction is always origin — `Fetch` can reach any URL,
`HttpEndpoint` is locked to one host. Additional constraints (path prefixes,
method restrictions, rate limits) are possible but are userspace policy
details, not OS-level concerns.

This is the standard object-capability attenuation pattern: same interface,
less authority. The application code is identical whether it holds a broad or
narrow `HttpEndpoint`.

## Boot and Initialization Sequence

The kernel doesn't know about services. It boots, creates a handful of
kernel-provided caps, and spawns exactly one process: `init`. Everything else
is init's responsibility.

### Current State vs Target State

The implementation has crossed the single-init startup milestone and the 15.4
schema split. `SystemManifest` now carries `schemaVersion`, `binaries`,
`initConfig`, and `kernelParams`. The Cap'n Proto schema no longer exposes
`ServiceEntry`, `ServiceCapSource`, `CapRef`, exports, or restart policy as
kernel-consumed fields. Those service-graph concepts remain as Rust parsing
types inside `capos-config` because the focused init executor still interprets
`initConfig.services`.

Each process now also carries an immutable session context produced at spawn
time by `kernel/src/session_context.rs`; default inheritance comes from the
parent's session context, and a broker can select a child session through the
`AuthorityBroker`/`UserSession` path. This invocation context is the basis for
session-scoped audit attribution and identity-policy enforcement; see
[user-identity-and-policy-proposal.md](user-identity-and-policy-proposal.md)
and `make run-session-context` for the one-session-per-process proof.

Current manifests put the first process description at `initConfig.init`.
The default `system.cue` manifest now boots the separate `init` binary with
BootPackage and ProcessSpawner; that init process reads `initConfig.services`
and starts the shell, remote-session CapSet gateway, chat server, and resident
demo services.
Focused shell-led manifests such as `system-smoke.cue` and `system-shell.cue`
still boot `capos-shell` as the lone init process for narrow login/shell
proofs. Focused init-executor manifests such as `system-spawn.cue`,
`system-chat.cue`, and `system-adventure.cue` boot the separate `init` binary
with `BootPackage` and `ProcessSpawner`; that init process reads
`initConfig.services` and resolves the remaining service graph through
`ProcessSpawner`. Other focused single-service or harness manifests still boot
a demo/service binary as the init process for narrow proofs. The kernel
validates only the kernel-owned boot boundary: schema version, binaries,
`kernelParams`, `initConfig.init.binary`, and kernel-sourced
`initConfig.init.caps`.

### Current Bootstrap Ownership Inventory

As of 2026-05-13, the repo is in the schema-split init-owned startup state:

- `schema/capos.capnp` defines `SystemManifest` as `schemaVersion`,
  `binaries`, `initConfig`, and `kernelParams`. Service graph fields are not
  Cap'n Proto schema fields.
- `capos-config/src/manifest.rs` still defines `ServiceEntry`, `CapRef`,
  `CapSource::Kernel`, `CapSource::Service`, and `RestartPolicy` as internal
  Rust types for parsing `initConfig.services`.
- `tools/mkmanifest` still embeds every declared binary into the manifest and
  validates the full init-owned graph before writing `manifest.bin`.
- `capos-config/src/validation.rs` separates kernel bootstrap validation from
  init graph validation. Kernel bootstrap validation covers binary names,
  `initConfig.init.binary`, init kernel cap sources, and `kernelParams`.
  Full graph validation covers `initConfig.services` for mkmanifest and init's
  metadata-only `ManifestBootstrapPlan` path.
- `kernel/src/main.rs::run_init` reads the Limine manifest module, validates
  the kernel-owned bootstrap contract, configures serial policy from
  `kernelParams`, and loads only `initConfig.init.binary`.
- `kernel/src/cap/mod.rs::create_boot_service_caps` builds only
  `initConfig.init.caps`. Those caps are kernel-sourced by type, so the kernel
  has no `CapSource::Service` branch.
- The init cap bundle is currently described by `initConfig.init.caps`. In the
  default `system.cue` manifest this grants the separate `init` binary the
  bootstrap caps it needs to read BootPackage and spawn the service graph. In
  focused shell-led manifests such as `system-smoke.cue`, this still grants
  `capos-shell` terminal, credential, session, audit, and broker capabilities
  directly. In focused single-service or harness manifests,
  `initConfig.init.caps` grants only the capabilities the harness itself needs.
- `BootPackage` exposes the full serialized manifest bytes to init.
  That path is live for default and focused init-executor manifests. Focused
  shell-led manifests do not grant `BootPackage` to `capos-shell`.
- `ProcessSpawner` owns the embedded binary set. It receives the boot manifest
  bytes so delegated `ProcessSpawner` grants can preserve that same boot
  package context; child `BootPackage` caps are not minted from
  `SpawnGrantSource::Kernel`. `ProcessSpawner.createPipe(bufferBytes)` mints a
  bounded SPSC kernel `Pipe` capability used by the POSIX adapter Phase P1.3
  recording-shim fork-for-exec path; see
  [posix-adapter-proposal.md](posix-adapter-proposal.md) §Phase P1.3 and
  [userspace-binaries-proposal.md](userspace-binaries-proposal.md) Part 4.
- `ProcessSpawner.spawn` resolves `SpawnGrantSource::Kernel` for the bounded
  manager-issued DDF authority surfaces (`DeviceMmio`, `DMAPool`,
  `Interrupt`, `HardwareAuditLog`) through the matching grant-source records
  in `kernel/src/cap/devicemmio_grant_source.rs`,
  `kernel/src/cap/dmapool_grant_source.rs`, and their interrupt/audit peers.
  Each grant attaches a fresh manager-owned record, validates owner/quiesce/
  scrub state for DMA-side caps, and returns a child-local handle without
  sharing the parent's owner object. See
  [device-driver-foundation.md Task 5](../backlog/hardware-boot-storage.md#task-5-userspace-dmapool-devicemmio-and-interrupt-authority-cap-surface)
  for the bounded-authority scope and the focused `make run-devicemmio-grant`,
  `make run-dmapool-grant`, `make run-interrupt-grant`, and
  `make run-hardware-audit` smokes.
- `init/src/main.rs` is the focused BootPackage executor. When that binary is
  the init process, it reads the BootPackage manifest, builds a
  `ManifestBootstrapPlan`, validates it again, discovers its own kernel grants
  from `initConfig.init.caps` plus the CapSet, preflights the
  `initConfig.services` graph, resolves kernel and service cap sources, records exports, spawns
  children through `ProcessSpawner`, and waits on their `ProcessHandle`s.
- `system.cue`, `system-smoke.cue`, `system-spawn.cue`, `system-chat.cue`,
  `system-adventure.cue`, and the other focused manifests now express their
  first-process bundle under `initConfig.init` and any child topology under
  `initConfig.services`.

The practical cleanup boundary is therefore not "move service startup to init";
that already happened. The current cleanup target is narrower: the kernel no
longer understands the service graph as a bootstrap authority structure. The
remaining future cleanup is to stop letting focused harnesses choose arbitrary
init binaries and direct kernel cap bundles, then move to one fixed generic-init
ABI.

### Narrowed Transitional Contract

The current schema is `schemaVersion`, `binaries`, `initConfig`, and
`kernelParams`. The narrowed kernel contract is:

- The kernel validates `schemaVersion`, parses `kernelParams` for
  kernel-consumed boot policy, and configures serial policy.
- The kernel resolves only `initConfig.init.binary` against `binaries` and loads
  only that ELF.
- The kernel may interpret `initConfig.init.caps` only as the bootstrap cap bundle
  for the single first process. Those caps must be kernel-sourced; a
  service-sourced cap in `initConfig.init.caps` is invalid because no non-init
  service exists at kernel handoff time.
- `initConfig.services[*]`, their `caps`, `exports`, `restart`, and any
  `CapSource::Service` references are init-owned configuration while the
  transitional Rust parser exists. `mkmanifest` and init continue validating
  them for smoke coverage, but kernel bootstrap does not run the multi-service
  graph validator or a service export resolver.
- Focused harness manifests that intentionally boot a demo/service binary as
  init stay valid during this slice. Their harness-specific caps are still
  described by `initConfig.init.caps` until those smokes are migrated behind a
  generic init-owned executor config.

Kernel bootstrap implements this contract with a first-service cap-table
builder. That builder covers only implemented kernel sources used by current
`initConfig.init.caps` lists.
That current first-service surface is wider than the eventual generic-init
minimum: the default init-owned path needs Console, TerminalSession,
CredentialStore, SessionManager, AuditLog, AuthorityBroker, BootPackage,
ProcessSpawner, listener, launcher, and chat endpoint authority so it can
launch the current service graph; focused shell-led paths still need
TerminalSession, CredentialStore, SessionManager, AuditLog, and
AuthorityBroker directly; focused harnesses need their own direct kernel caps.
Cross-service export lookup, service-source attenuation, and non-init
cap-resolution policy stay in `init/src/main.rs` for the focused
BootPackage-executor manifests.

### Target Boot Package Contract

After the harness migration, `SystemManifest` should keep the same outer shape
but `initConfig.init` should stop being a per-manifest kernel bootstrap bundle.
At that point:

- `ServiceEntry`, `CapRef`, `CapSource::Service`, service exports, and restart
  policy remain ordinary data inside `initConfig`, interpreted and validated by
  init or a supervisor service.
- Kernel validation is limited to the schema version, kernel parameters,
  boot-package integrity/measurement policy, and enough binary metadata to load
  the one init image.
- The first process is the generic init/supervisor, not a demo harness or
  shell. Shell-led and focused single-service proofs should become init-owned
  configurations rather than alternate kernel bootstrap contracts.
- The fixed direct kernel bundle for that generic init starts with `Console`,
  `BootPackage`, and `ProcessSpawner` in the currently implemented system.
  This is the target generic-init minimum, not the full transitional
  `initConfig.init.caps` surface.
  The architecture-level target also includes `Timer`, `DeviceManager`,
  `FrameAllocator`, and per-process `VirtualMemory` once those authorities are
  ready to be part of init's stable bootstrap ABI. Until then,
  FrameAllocator, VirtualMemory, and Endpoint grants for child processes remain
  minted through `ProcessSpawner` spawn grants.

The target model removes the kernel-side service graph entirely. The manifest
stops being a kernel authority graph and becomes a **boot package** delivered
to init:

- List of embedded binaries (init needs them before any storage service
  exists; they can't be fetched from a filesystem that hasn't started).
- Init's config blob (CUE-encoded tree; what to spawn, with what
  attenuations, with what restart policy).
- Kernel boot parameters (memory limits, feature flags) consumed by the
  kernel itself, not forwarded to init.

The kernel spawns exactly one userspace process (init) with a fixed cap
bundle:

- `Console` — kernel serial wrapper (may be replaced later by a userspace
  log service, with init retaining a direct console cap for emergency use).
- `ProcessSpawner` — only init and its delegated supervisors hold this.
- `FrameAllocator` — physical frame authority for init's own allocations.
- `VirtualMemory` — per-process address-space authority for init.
- `DeviceManager` — enumerate/claim devices; init delegates device-specific
  slices to drivers.
- `Timer` — monotonic clock.
- `BootPackage` — read-only cap exposing the embedded binaries and the
  config blob.

Everything else — drivers, net-stack, filesystems, supervisors, apps —
init spawns at runtime via `ProcessSpawner` with appropriate attenuation.
No manifest `ServiceEntry`, no cross-service `CapRef`, no manifest exports.

### Pre-Init Boundary After Stage 6

Rule of thumb: **no userspace service runs before init.** The kernel's job is
primitive cap synthesis and a single-process handoff; init's job is the whole
service graph. Concretely, after Stage 6:

- **Stays in kernel pre-init:** memory map ingest, frame allocator, heap,
  paging, GDT/IDT/TSS, serial for kernel diagnostics, scheduler, ring
  dispatch, kernel-cap `CapObject` impls, ELF loading for init, boot
  package measurement (if attested boot is added).
- **Stays in manifest:** binaries list + init config blob + kernel boot
  params. Schema-wise, `ServiceEntry` and `CapSource::Service` disappear;
  `SystemManifest` shrinks to `binaries + initConfig + kernelParams`.
- **Moves to init:** service topology, cross-service cap wiring,
  attenuation, restart policies, dynamic spawn, cap export/import,
  supervision trees. Anything a service manager would do.
- **Moves to init or later services:** logging policy, config store,
  secrets, filesystem mounts, network configuration, device binding.

Edge cases that might *look* like they want a pre-init service but don't:

- **Early crash / panic handling.** Kernel-side panic handler, no service
  needed.
- **Recovery shell.** Kernel fallback: if init fails to reach a healthy
  state within a timeout (e.g. exits immediately, or never issues a
  liveness SQE), kernel optionally spawns a "recovery" binary from the
  boot package with the same cap bundle. Still just one userspace process
  at a time pre-supervisor-loop.
- **Attested/measured boot.** Kernel hashes binaries in the boot package
  before handing `BootPackage` to init. The measurement agent, if any,
  runs as a normal service spawned by init with a cap to the sealed
  measurements.
- **Early-boot console.** Kernel owns serial and exposes `Console` to init.
  A userspace log service can layer on top later; it is not pre-init.

### Legacy Manifest Fields After Stage 6

`ServiceEntry.caps`, `CapSource::Service`, and `ServiceEntry.exports` are
transitional init configuration, not kernel schema. The 15.4 schema split
deleted them from `schema/capos.capnp`, collapsed the service graph into
`initConfig: CueValue`, and kept kernel bootstrap on the first-service cap-table
builder. The remaining cleanup is to make that first-service bundle fixed
rather than manifest-selected:

1. Move shell-led and focused harness proofs behind an init-owned executor
   config instead of booting their binaries directly as init.
2. Embed or otherwise pin the generic init image as the only kernel-loaded
   userspace image. Partially landed (2026-05-25 23:26 UTC): the `init` image is embedded
   and loaded from `kernel::boot::INIT_ELF` whenever `init.binary == "init"` (see
   "Init Binary Embedding"). It is not yet the *only* kernel-loaded image —
   until step 1 moves the focused/shell proofs behind an init-owned executor,
   non-`"init"` PID-1 selectors are still kernel-loaded from `binaries`.
3. Replace per-manifest `initConfig.init.caps` with the fixed bootstrap cap
   bundle described above plus `BootPackage`.
4. Keep `initConfig.services` as ordinary init/supervisor configuration until a
   later libcapos or supervisor API gives it a more concrete format.

The re-export restriction added in `capos-config::validate_manifest_graph`
(service A exports cap sourced from `B.ep`) becomes moot at that point
because there are no kernel-owned manifest exports at all. It stays as
defensive validation for `initConfig.services` while the transitional
init-owned executor exists.

### Init Binary Embedding

Status: landed 2026-05-25 23:26 UTC as a hybrid keyed on the reserved init
selector (see below). Init is part of the kernel's bootstrap contract, not a
configuration choice: the cap bundle handed to init is a kernel ABI, the
`_start(ring, pid, …)` entry shape is a kernel ABI, and a version-mismatched
init is a footgun with no payoff in a single-init research OS. So the init ELF
ships *inside* the kernel binary via `include_bytes!`, not as a separate
manifest entry or Limine module.

Shape (as landed):

- `init/` stays a standalone crate with its own linker script and code
  model (user-space base `0x200000`, `static` relocation model, 4 KiB
  alignment). Not a workspace member; different build flags than the
  kernel.
- `kernel/build.rs` reads the prebuilt `init/` artifact (the Makefile passes
  `CAPOS_INIT_ELF` and orders `init` before the kernel; a conventional-path
  fallback covers a bare `cargo build` after init is built) and emits an
  `include_bytes!("…")` into a `kernel::boot::INIT_ELF: &[u8]` static. Driving
  init's build from `build.rs` was rejected to avoid duplicating its custom
  target/code-model flags; failing closed on a missing artifact is the chosen
  behavior.
- `initConfig.init.binary` is a generic "which binary is PID 1" selector, so
  embedding is **keyed on the reserved name**
  `capos_config::RESERVED_INIT_BINARY_NAME` (`"init"`). When `init.binary ==
  "init"`, kernel bootstrap parses `INIT_ELF` through the same `capos_lib::elf`
  path used for service binaries, creates the init address space via
  `AddressSpace::new_user()`, loads segments, populates the cap bundle
  (including `BootPackage`), and jumps — no Limine module lookup and no
  `binaries` resolution for that identity. When `init.binary` names any other
  binary (the shell on `run-smoke`, the ~70 focused test-as-PID-1 manifests),
  PID 1 still resolves from `SystemManifest.binaries` exactly as before.
- The reserved name `"init"` must not appear in `SystemManifest.binaries`:
  manifest validation (`capos-config` and `mkmanifest`) rejects it, since the
  kernel owns the `init` image. Real-init manifests drop their `init` entry;
  their `binaries` list is services-only.
- The embedded image is the canonical `init` binary, so init's own child
  spawns that reference `init` by name (e.g. `system-spawn.cue`'s
  spawn-hardening fixtures) still resolve: when init is embedded, `run_init`
  injects the embedded bytes into the `ProcessSpawner` binary set under the
  reserved name (the `BootPackage` cap serves only the serialized manifest
  bytes, which never carry the reserved entry). This keeps the spawnable set
  identical to the pre-embedding state without `init` re-entering the serialized
  manifest. Service binaries remain distinct `BootPackage` blobs.
- Measured-boot attestation (if added) covers the kernel ELF, which
  transitively covers init's bytes. Service binaries are hashed
  separately by the kernel before handing `BootPackage` to init.

What this does *not* change:

- Init still runs in Ring 3 with its own page tables; embedding is byte
  packaging, not privilege merging.
- Init is still ELF-parsed at boot — the same loader and W^X enforcement
  apply. The only thing different is where the bytes came from.
- Service binaries (everything spawned after init) stay in the boot
  package as distinct blobs, exposed to init via `BootPackage`. They are
  *not* linked into the kernel; their lifecycle is independent of the
  kernel's.

What option was rejected: fully linking init into the kernel crate (shared
compilation unit, shared text). That collapses the kernel/user build
boundary, couples linker scripts and code models, and puts init's
panics/UB inside the kernel's compilation context. The process-isolation
boundary survives that arrangement — but the build-time separation that
makes the boundary trustworthy does not. `include_bytes!` preserves the
separation; static linking destroys it.

```
Kernel boot
  │
  ├─ Create kernel caps: Console, Timer, DeviceManager, ProcessSpawner
  │
  └─ Spawn init with all kernel caps
       │
       init process (PID 1)
         │
         ├─ Phase 1: Core services (sequential — each depends on previous)
         │    ├─ DeviceManager.enumerate() → list of devices
         │    ├─ Spawn NIC driver with device-specific caps
         │    ├─ Wait for NIC driver to export Nic cap
         │    ├─ Spawn net-stack with Nic + Timer caps
         │    └─ Wait for net-stack to export NetworkManager cap
         │
         ├─ Phase 2: Higher-level services (can be parallel)
         │    ├─ Spawn http-service with TcpSocket cap from net-stack
         │    ├─ Spawn dns-resolver with UdpSocket cap
         │    └─ ...
         │
         └─ Phase 3: Applications
              ├─ Spawn app-a with HttpEndpoint("api.example.com")
              ├─ Spawn app-b with Fetch cap (trusted)
              └─ ...
```

### The Init Process in Detail

Init is a regular userspace process with privileged caps. It is the only
process that holds `ProcessSpawner` (the right to create new processes) and
`DeviceManager` (the right to enumerate and claim devices). It can delegate
subsets of these to child supervisors.

```rust
// init/src/main.rs — this IS the system configuration

fn main(caps: CapSet) {
    let spawner = caps.get::<ProcessSpawner>("spawner");
    let devices = caps.get::<DeviceManager>("devices");
    let timer = caps.get::<Timer>("timer");
    let console = caps.get::<Console>("console");

    // === Phase 1: Hardware drivers ===

    // Find the NIC
    let nic_device = devices.find("virtio-net")
        .expect("no network device found");

    // Spawn NIC driver — gets ONLY its device's MMIO + IRQ
    let nic_driver = spawner.spawn(SpawnRequest {
        binary: "/sbin/virtio-net",
        caps: caps![
            "device_mmio" => nic_device.mmio(),
            "interrupt"   => nic_device.interrupt(),
            "log"         => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    // The driver exports a Nic cap once initialized
    let nic: Cap<Nic> = nic_driver.exported("nic").wait();

    // === Phase 2: Network stack ===

    let net_stack = spawner.spawn(SpawnRequest {
        binary: "/sbin/net-stack",
        caps: caps![
            "nic"   => nic,
            "timer" => timer.clone(),
            "log"   => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    let net_mgr: Cap<NetworkManager> = net_stack.exported("net").wait();

    // === Phase 3: HTTP service ===

    let tcp = net_mgr.create_tcp_pool();

    let http_service = spawner.spawn(SpawnRequest {
        binary: "/sbin/http-service",
        caps: caps![
            "tcp" => tcp,
            "log" => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    let fetch: Cap<Fetch> = http_service.exported("fetch").wait();

    // === Phase 4: Applications ===

    // Trusted telemetry agent — gets full Fetch
    spawner.spawn(SpawnRequest {
        binary: "/sbin/telemetry",
        caps: caps![
            "fetch" => fetch.clone(),
            "log"   => console.clone(),
        ],
        restart: RestartPolicy::OnFailure,
    });

    // Sandboxed app — gets scoped HttpEndpoint
    let api_cap = fetch.attenuate(EndpointPolicy {
        origin: "https://api.example.com",
        paths: Some("/v1/users/*"),
        methods: Some(&["GET", "POST"]),
    });

    spawner.spawn(SpawnRequest {
        binary: "/app/my-service",
        caps: caps![
            "api" => api_cap,
            "log" => console.clone(),
        ],
        restart: RestartPolicy::OnFailure,
    });

    // Init stays alive as the root supervisor
    supervisor_loop(&spawner);
}
```

### Key Mechanisms

**Cap export.** A spawned process can export capabilities back to its parent
via the `ProcessHandle` (see Spawn Mechanism section). This is how the NIC
driver makes its `Nic` cap available to the network stack — init spawns the
driver, waits for it to export `"nic"`, then passes that cap to the next
process.

**Restart policy.** Encoded in `SpawnRequest`, enforced by the supervisor
loop in the spawning process. When a child exits unexpectedly:

1. Old caps held by the child are automatically revoked (kernel invalidates
   the process's cap table on exit)
2. Supervisor re-spawns with the same `SpawnRequest`
3. New instance gets fresh caps — same authority, new identity

**Dependency ordering.** Sequential in code: `wait()` on exported caps
blocks until the dependency is ready. No declarative dependency graph
needed — Rust's control flow *is* the dependency graph.

## Service Taxonomy

Concrete categories of userspace services capOS expects to run. All spawned
by init (or a supervisor init delegates to) after Stage 6. None are
pre-init.

### Hardware Drivers

One process per managed device. Each holds exactly the caps for its own
hardware: an `DeviceMmio` slice, the corresponding `Interrupt` cap, and
optionally a `DmaRegion` cap carved out of the frame allocator. Exports a
typed device cap (`Nic`, `BlockDevice`, `Framebuffer`, `Gpu`, …). Examples:
virtio-net, virtio-blk, NVMe, AHCI, framebuffer/GPU.

### Platform Services

- **Logger / journal** — accepts `Log` cap writes, forwards to console
  and/or durable storage. Init and kernel bootstrap use a direct `Console`
  cap until the logger is up; afterwards new services get `Log` caps only.
- **Filesystem** — one per mounted volume. Consumes a `BlockDevice` cap,
  exports `Directory` / `File` caps. FAT, ext4, overlay, tmpfs.
- **Store** — capability-native content-addressed storage backing
  persistent capability state (`storage-and-naming-proposal.md`).
- **Network stack** — userspace TCP/IP (`networking-proposal.md`).
  Consumes `Nic` + `Timer`, exports `NetworkManager`, `TcpSocket`,
  `UdpSocket`, `TcpListener`.
- **DNS resolver** — consumes a `UdpSocket`, exports `Resolver`.
- **Config / secrets store** — reads the initial config from `BootPackage`,
  exposes runtime `Config` and `Secret` caps with per-key attenuation.
- **Cloud metadata agent** — detects IMDS / ConfigDrive / SMBIOS on cloud
  boot and delivers a `ManifestDelta` (`cloud-metadata-proposal.md`).
- **Upgrade manager** — orchestrates `CapRetarget` for live service
  replacement (`live-upgrade-proposal.md`).
- **Capability proxy** — makes selected local caps reachable over the network.
  The near-term shape is typed Cap'n Proto RPC or a schema-framed proxy,
  following Cloudflare's production pattern of schema-bundled Workers bindings
  to internal services; later remote-capability sessions can borrow
  Spritely/OCapN CapTP's session, handoff, and reference-lifetime model without
  treating current OCapN drafts as capOS ABI commitments. The proxy must never
  serialize local `CapId` values, endpoint generations, receiver selectors, or
  kernel/session ids as portable authority, and it must own explicit resource
  ledgers for remote refs, queued calls, streams, and retries. See
  [Cloudflare, Cap'n Proto, Workers RPC, and Cap'n Web](../research/cloudflare-capnproto-workers.md)
  and [Spritely, OCapN, and CapTP](../research/spritely-captp-ocapn.md).
- **Measurement / attestation agent** — consumes sealed kernel hashes
  from `BootPackage`, exposes `Quote` caps for remote attestation.

### Supervisors

Per-subsystem restart managers that hold a narrowed `ProcessSpawner` plus
the caps of the subtree they own. If any child crashes, the supervisor
tears down and re-spawns the set. Example: `net-supervisor` owns NIC
driver + net-stack + DHCP client.

### Application Services

User-facing or user-spawned processes: HTTP servers, API gateways, worker
pools, shells, interactive tools. Hold only the narrow caps the supervisor
grants (`HttpEndpoint` for one origin, `Directory` for one mount, etc.).
Human users, service accounts, guests, and anonymous callers are represented
by session/profile services that grant scoped cap bundles; they are not kernel
subjects or ambient process credentials. See
[user-identity-and-policy-proposal.md](user-identity-and-policy-proposal.md).

### What Does *Not* Become a Service

- **Console / serial** — stays in the kernel as a `CapObject` wrapper.
  Small enough, needed for kernel diagnostics, no benefit from userspace
  isolation. A userspace log service can layer on top.
- **Frame allocator, virtual memory, scheduler, ring dispatch** — kernel
  primitives, exposed as caps but not as services.
- **Interrupt delivery, DMA mapping** — kernel mechanisms, exposed to
  drivers as caps.
- **Boot measurement** — if added, happens in the kernel before `BootPackage`
  exists; the measurement agent (userspace) only reports them.

## Supervision

### Supervision Tree

Init doesn't have to supervise everything directly. It can delegate:

```
init (root supervisor)
  ├─ net-supervisor (holds: spawner subset, device caps)
  │    ├─ virtio-net driver
  │    ├─ net-stack
  │    └─ http-service
  └─ app-supervisor (holds: spawner subset, service caps)
       ├─ my-service
       └─ another-app
```

Each supervisor is a process that holds a `ProcessSpawner` cap (possibly
restricted to specific binaries) and the caps it needs to grant to children.
If `net-supervisor` crashes, init restarts it, and it re-spawns the entire
networking subtree.

### Supervisor Loop

```rust
fn supervisor_loop(children: &[SpawnRequest], spawner: &ProcessSpawner) {
    let mut handles: Vec<ProcessHandle> = children.iter()
        .map(|req| spawner.spawn(req.clone()))
        .collect();

    loop {
        // Wait for any child to exit
        let (index, exit_code) = wait_any(&handles);
        let req = &children[index];

        match req.restart {
            RestartPolicy::Always => {
                handles[index] = spawner.spawn(req.clone());
            }
            RestartPolicy::OnFailure if exit_code != 0 => {
                handles[index] = spawner.spawn(req.clone());
            }
            _ => {
                // Process exited normally, don't restart
            }
        }
    }
}
```

### Socket Activation

systemd pre-creates a socket and passes the fd to the service on first
connection. In capOS, the supervisor does the same with caps:

**Eager** (default): supervisor spawns the child immediately with a
`TcpListener` cap. Child calls `accept()` and blocks.

**Lazy**: supervisor holds the `TcpListener` cap itself. On first incoming
connection (or on first `accept()` from a proxy cap), it spawns the child
and transfers the cap. The child code is identical in both cases.

```rust
// Lazy activation — supervisor holds the listener until needed
let listener = net_mgr.create_tcp_listener();
listener.bind([0,0,0,0], 8080);

// This blocks until a connection arrives
let _conn = listener.accept();

// Now spawn the actual service, giving it the listener
spawner.spawn(SpawnRequest {
    binary: "/app/web-server",
    caps: caps!["listener" => listener, "log" => console.clone()],
    restart: RestartPolicy::Always,
});
```

## Configuration

See [docs/proposals/storage-and-naming-proposal.md](storage-and-naming-proposal.md)
for the full storage, naming, and configuration model.

Summary: the system topology is currently defined in a capnp-encoded
**system manifest** baked into the boot image. `tools/mkmanifest` compiles the
human-authored `system.cue`, `system-smoke.cue`, or focused manifest sources
such as `system-spawn.cue`, `system-devicemmio-grant.cue`, and
`system-wasi-random.cue` into the binary manifest. Default boot uses
standalone `init` and init-owned service-graph execution; focused shell-led
manifests still grant login/session/broker caps directly to `capos-shell` for
narrow smokes. Focused init-executor manifests let the separate `init` binary
validate and execute the manifest through `ProcessSpawner`; the old generic
kernel resolver has been replaced by first-service cap construction.
Manifest-declared `SpawnGrantSource::Kernel` entries cover the bounded DDF
authority surface (`DeviceMmio`, `DMAPool`, `Interrupt`, `HardwareAuditLog`)
and the wasm-host's optional `EntropySource` grant; the WASI host adapter
(see [wasi-host-adapter-proposal.md](wasi-host-adapter-proposal.md)) and the
POSIX adapter (see [posix-adapter-proposal.md](posix-adapter-proposal.md))
both run as ordinary userspace processes spawned through this same path.
Remaining cleanup is to move runtime configuration into a capability-based
store service once that service exists. See also the layered CUE configuration
model in
[system-configuration-proposal.md](system-configuration-proposal.md).

## Comparison with Traditional Approaches

| Concern | systemd/Linux | capOS |
|---|---|---|
| Service dependencies | `Wants=`, `After=`, `Requires=` | Implicit in cap graph |
| Sandboxing | seccomp, namespaces, AppArmor | Default: zero ambient authority |
| Socket activation | `ListenStream=`, fd passing protocol | Pass `TcpListener` cap |
| Restart policy | `Restart=on-failure` | Supervisor process loop |
| Logging | journald, `StandardOutput=journal` | `Log` cap in granted set |
| Resource limits | cgroups, `MemoryMax=`, `CPUQuota=` | Bounded allocator caps |
| Network access control | firewall rules (iptables/nftables) | Scoped `HttpEndpoint` / `TcpSocket` caps |
| Config format | INI-like unit files (~1500 directives) | Rust code or minimal manifest |
| Trusted computing base | systemd PID 1 (~1.4M lines) | Init process (hundreds of lines) |

## Spawn Mechanism

Spawning is a capability-gated operation. The kernel provides a
`ProcessSpawner` capability — only the holder can create new processes.

### Implemented Kernel Slice

The kernel now provides:

1. **`ProcessSpawner` capability** — a `CapObject` impl in
   `kernel/src/cap/process_spawner.rs`. Methods:
   - `spawn(name, binaryName, grants) -> handleIndex` — resolve a boot-package
     binary, load ELF, create address space (builds on existing `elf.rs`
     loader and `AddressSpace::new_user()` in `mem/paging.rs`), populate the
     initial cap table, schedule the process, and return the `ProcessHandle`
     through the ring result-cap list
   - the returned `ProcessHandle` cap lets the parent wait for child exit in
     the first slice; exported caps and kill semantics are later lifecycle work

2. **Initial cap passing** — at spawn time, the kernel copies permitted parent
   cap references into the child's cap table or mints authorized child-local
   kernel caps. Raw grants preserve the source legacy badge. Endpoint-client
   grants may mint a requested legacy badge only from an endpoint owner or
   trusted parent endpoint result source; delegated client facets must preserve
   their existing service identity. Child-local Endpoint, FrameAllocator, and
   VirtualMemory grants are created for the child's process. Child-local
   endpoint grants return parent-side client facets as result caps instead of
   sharing the endpoint owner object. The parent's references are unaffected.
   Legacy endpoint badges are transitional; new multi-client service identity
   should use session-bound invocation context plus broker-granted service
   roots/facets.

3. **Cap export** — future lifecycle work will let a child register a cap by
   name in its `ProcessHandle`, making it available to the parent (or anyone
   holding the handle). This is the mechanism behind
   `nic_driver.exported("nic").wait()` once exported-cap lookup is added.

### Schema

```capnp
interface ProcessSpawner {
    spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (
        handleIndex :UInt16,
        capabilityManagerIndex :UInt16,
    );
    createPipe @1 (bufferBytes :UInt32) -> (readIndex :UInt16, writeIndex :UInt16);
}

struct CapGrant {
    name @0 :Text;
    capId @1 :UInt32;
    interfaceId @2 :UInt64;
    mode @3 :CapGrantMode;
    badge @4 :UInt64;
    source @5 :CapGrantSource;
}

struct CapGrantSource {
    union {
        capability @0 :Void;
        kernel @1 :KernelCapSource;
    }
}

enum CapGrantMode {
    raw @0;
    clientEndpoint @1;
    move @2;
    serviceObject @3;
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
    terminate @1 () -> ();
}
```

**Note on capability passing:** Capabilities are referenced by cap table
slot IDs (`UInt32`), not by Cap'n Proto's native capability table mechanism.
`spawn()` returns the `ProcessHandle` and a `CapabilityManager` cap through
the ring result-cap list; `handleIndex` and `capabilityManagerIndex` identify
those transferred caps in the completion. The first slice passes a
boot-package `binaryName` instead of raw ELF bytes so the request stays
within the bounded ring parameter buffer. `terminate` (deferred kill) is
implemented on `ProcessHandle`; post-spawn grants and exported-cap lookup
remain future lifecycle work until their authority semantics are implemented.
capOS uses manual capnp dispatch (`CapObject` trait with raw message bytes,
not capnp-rpc), so cap references are plain integers and typed result caps use
the ring transfer-result metadata. See
[userspace-binaries-proposal.md](userspace-binaries-proposal.md) Part 7 for
the surrounding userspace bootstrap schema context, Part 4 for the POSIX
adapter surface that consumes `ProcessSpawner.createPipe` plus the
recording-shim fork-for-exec successor `posix_spawn` over the same Move-grant
path, and Part 5 for the WASI host adapter that runs as a userspace process
spawned through this same `ProcessSpawner` with manifest-supplied capability
grants ([wasi-host-adapter-proposal.md](wasi-host-adapter-proposal.md)).

### Relationship to Existing Code

The current kernel has these pieces in place:

- **ELF loading** (`kernel/src/elf.rs`) — parses PT_LOAD segments, validates
  alignment, and feeds the reusable spawn primitive behind `ProcessSpawner`.
- **Address space creation** (`kernel/src/mem/paging.rs`) —
  `AddressSpace::new_user()` creates isolated page tables with the kernel
  mapped in the upper half.
- **Cap table** (`kernel/src/cap/table.rs`) — `CapTable` with `insert()`,
  `get()`, `remove()`, transfer preflight, provisional insert, commit, and
  rollback helpers. Each `Process` owns one local table.
- **Process struct and scheduler** (`kernel/src/process.rs`,
  `kernel/src/sched.rs`) — a process table plus round-robin run queue are in
  place for both legacy manifest-spawned services and init-spawned children.

Generic capability transfer/release and the reusable `ProcessSpawner`
lifecycle path are complete enough for the focused init-owned spawn executor.
Default startup now uses standalone `init` for service-graph execution, while
focused shell-led startup remains for narrow smokes.
`ProcessSpawner.createPipe` extends the lifecycle surface with a bounded SPSC
kernel `Pipe` capability consumed by the POSIX adapter's recording-shim
fork-for-exec path (P1.3) and exposed as the `posix_spawn` successor on the
same Move-grant path. The DDF Task 5 grant-source families
(`devicemmio_grant_source.rs`, `dmapool_grant_source.rs`, and their
interrupt/audit peers) extend `SpawnGrantSource::Kernel` with the bounded
manager-issued DDF authority surface; production handle lifecycle, hardware-
backed driver wait/ack dispatch beyond bounded route proofs, and the S.11.2
hostile-smoke gates remain open. Each spawned process also receives one
immutable session context (default-inherited from the parent or
broker-selected), used as the invocation subject for audit attribution and the
identity-policy boundary. Remaining lifecycle gaps are post-spawn grants, runtime exported-cap lookup,
restart supervision, and shrinking the transitional manifest schema.
`ProcessHandle.terminate` (deferred kill) is implemented.

## Prerequisites

| Prerequisite | Status | Why |
|---|---|---|
| ELF loading + address spaces | Done (Stage 2-3) | `elf.rs`, `AddressSpace::new_user()` |
| Capability ring + cap_enter | Done (Stage 4/6 foundation) | Ring-based cap invocation with blocking waits |
| Scheduling + preemption (core) | Done (Stage 5) | Round-robin, PIT 100 Hz, context switch |
| Cross-process Endpoint IPC | Done (Stage 6 foundation) | CALL/RECV/RETURN routing through Endpoint objects |
| Generic cap transfer/release | Done (Stage 6, 2026-04-22/24) | Copy/move transfer, result-cap insertion, `CAP_OP_RELEASE`, epoch revocation, and revoked endpoint `Disconnected` error surface |
| ProcessSpawner + ProcessHandle | Done (Stage 6, 2026-04-22) | Init-driven spawn with grants, `wait` completion, hostile-input coverage; `kill`/post-spawn grants still future |
| ProcessSpawner.createPipe + recording-shim fork-for-exec | Done (POSIX adapter P1.3, 2026-05-07 09:55 UTC) | Bounded SPSC `Pipe` capability and Move-grant fork-for-exec successor; see [posix-adapter-proposal.md](posix-adapter-proposal.md) §Phase P1.3 and [userspace-binaries-proposal.md](userspace-binaries-proposal.md) Part 4 |
| DDF bootstrap-grant sources (`DeviceMmio`, `DMAPool`, `Interrupt`, `HardwareAuditLog`) | In progress (DDF Task 5) | Bounded manager-issued authority over `SpawnGrantSource::Kernel`; production handle lifecycle and S.11.2 hostile smokes remain open. See [device-driver-foundation.md Task 5](../backlog/hardware-boot-storage.md#task-5-userspace-dmapool-devicemmio-and-interrupt-authority-cap-surface) |
| Immutable per-process session context | Done (`kernel/src/session_context.rs`) | One session context per process, default-inherited or broker-selected; `make run-session-context` proof |
| Authority graph + quota design (Security Verification Track S.9) | Done (2026-04-21) | Defines transfer/spawn invariants, per-process quotas, and rollback rules; see `docs/authority-accounting-transfer-design.md` |

This proposal describes the target architecture. Individual pieces (like
`Fetch`/`HttpEndpoint`) are additive — they're userspace processes that
compose existing caps into higher-level ones. No kernel changes needed
beyond Stages 4-6.

## First Step After Transfer and ProcessSpawner — done 2026-04-23

The minimal demonstration of this architecture landed together with capability
transfer and `ProcessSpawner`:

1. **`ProcessSpawner` cap** in `kernel/src/cap/process_spawner.rs` wraps ELF
   loading and address-space creation behind a typed capability.
2. **Init spawns children** — focused `make run-spawn` boots a single-init
   manifest; the kernel boots only the separate `init` binary from
   `initConfig.init`, then `init` spawns the focused demo graph from
   `initConfig.services` through `ProcessSpawner`, grants child-local endpoint
   owners and client facets, then releases parent endpoint facets before waiting
   on each `ProcessHandle`.
3. **Cross-process cap invocation** — spawned client invokes the server's
   Endpoint cap, server replies, both print to console.

This exercises: spawn cap, initial cap passing, manifest-declared export
recording, cross-process cap invocation, hostile-input rejection, and
per-process resource exhaustion paths. Deleting the unused legacy kernel
resolver is post-milestone cleanup tracked in `docs/tasks/`.

## Open Questions

1. **Restart supervision.** Epoch-based cap revocation and generation-tagged
   stale reference detection are implemented for current grant/revoke flows.
   Restart policy still needs a supervisor contract that epoch-bumps caps
   served by the failed process, restarts from the manifest, and reconnects
   clients through explicit authority rather than ambient service lookup.

2. **Cap discovery.** How does a process learn what caps it was given?
   Resolved: name→(cap_id, interface_id) mapping passed at spawn via a
   well-known page (`CapSet`). See
   [userspace-binaries-proposal.md](userspace-binaries-proposal.md) Part 2.
   `cap_id` is the authority-bearing table handle. `interface_id` is the
   transported capnp `TYPE_ID` used by typed clients to check that the handle
   speaks the expected interface.

3. **Lazy spawning.** Should the init process start everything eagerly, or
   should caps be backed by lazy proxies that spawn the backing service on
   first invocation?

4. **Cap persistence.** If the system reboots, should the cap graph be
   reconstructable from saved state? Or is it always rebuilt from init code?

5. **Delegation depth.** Can an application further delegate its
   `HttpEndpoint` cap to a subprocess? If so, the HTTP gateway needs to
   support fan-out. If not, how is this restriction enforced?
