# Proposal: Crash Recovery and Supervision

How capOS handles unplanned process failure: propagating the death to capability
holders, recording a structured crash event, and restarting the service within a
bounded policy — all without resurrecting stale authority.

## Problem

[Live upgrade](live-upgrade-proposal.md) covers the planned case: a supervisor
quiesces a running service, transfers state, retargets caps, and exits the old
process in a controlled sequence.  Unplanned failure is different.  A process
that panics, faults, or is killed by the kernel OOM path leaves no `quiesce`
call, no state handoff, and no ordered `exit`.  The kernel marks the process
dead and epoch-bumps its caps, but nothing in the current model tells callers
what happened or gives the supervisor a policy-bounded path to respawn it.

The gaps are:

1. **Stale-cap observability.** Callers holding a cap to the dead process
   receive `disconnected` errors at the transport level (the epoch-revocation
   path from Stage 6 is in place), but there is no structured CQE event that
   carries crash context or lets the caller distinguish a crash from a
   planned termination.
2. **Crash metadata capture.** Panic location, fault address, and last SQE
   opcode are useful for operators but must not leak raw cap-table contents,
   local cap IDs, or buffer bytes, which would break the no-ambient-authority
   invariant.
3. **Bounded restart policy.** Re-spawning a crashing service without a budget
   produces crash-loop amplification; re-spawning must use the same broker and
   manifest authority that the original spawn used, not an escalated path.
4. **Watchdog liveness.** A process that hangs without crashing is not
   detected by crash handling alone.
5. **Degraded boot.** If a critical service fails to start, the system needs a
   safe fallback rather than a silent hang.

This proposal fills these gaps without touching the live-upgrade protocol and
without adding a god-object supervisor.

## User Stories

- An operator running `make run-smoke` sees a structured crash record in the
  audit log when a demo service panics, not a silent stale-cap error.
- A client process calling a crashed server receives a `disconnected`-class
  `CapException` promptly; the process does not block indefinitely.
- Init restarts a failed service up to the configured failure budget, then
  stops and declares the service permanently failed rather than looping forever.
- A watchdog-registered service that hangs (no panic, no exit) is detected
  within its timeout and restarted under the same policy.
- If the network stack fails before a shell connects, the manifest-declared
  emergency shell starts instead of leaving the system unresponsive.

## Design

### Stale-Cap Propagation

When the kernel marks a process dead (panic, fault, or explicit `terminate`
without a prior clean `exit`), it performs the same epoch-bump it already does
for released caps.  The existing `disconnected` value in `ExceptionType`
covers the transport error.  The new addition is a **death CQE**: a
`CapException { type: disconnected, message: "server-death" }` delivered to
any process with an outstanding CALL SQE whose target belongs to the dead
process.

From the caller's perspective an unplanned crash looks identical to a
`force`-mode live upgrade that did not reattach: the in-flight CALL returns
`disconnected`, epoch is bumped, and any subsequent CALL on that cap also
returns `disconnected` until the supervisor retargets the cap to a fresh
instance.  No new CQE opcode is needed; the existing two-level error model
from [error-handling-proposal.md](error-handling-proposal.md) is sufficient.

**Invariants:**

- A `disconnected` CQE on an outstanding CALL must be delivered before the
  kernel recycles any frame that belonged to the dead process.  Frame reuse
  ordering is the same constraint that applies to the force-mode live-upgrade
  path.
- A cap whose epoch has been bumped must never route a new CALL to the dead
  process's address space, even transiently.  The epoch check is a load fence
  on the per-cap generation counter before any ring dispatch.
- Endpoint client facets held by the dead process are revoked at the same
  epoch bump.  Other processes' client facets to the same endpoint are not
  affected — they route to the endpoint owner, not to the crashed client.

### Crash Record Capture

When a process dies unplanned, the kernel appends a **crash record** to the
`AuditLog` cap held by the supervisor that spawned the process (not to a
global log visible to all processes).  The record is structured to support
operator debugging without leaking internal kernel state:

```capnp
# Proposed addition to schema/capos.capnp (Phase 1)

enum CrashKind {
    panic @0;         # Rust panic! path
    pageFault @1;     # unmapped or protection fault
    generalProtection @2;
    stackOverflow @3;
    illegalInstruction @4;
    kernelKill @5;    # explicit ProcessHandle.terminate
}

struct CrashRecord {
    processName @0 :Text;
    kind @1 :CrashKind;
    # Instruction pointer at death, relative to ELF load base.
    # Absolute virtual address is NOT included to avoid leaking
    # kernel-side layout or userspace ASLR seeds.
    faultOffsetInBinary @2 :UInt64;
    # Last SQE opcode dispatched for this process (0 = none in flight).
    lastSqeOpcode @3 :UInt8;
    # Session context ID of the process (opaque; matches AuditLog sessionId
    # for attribution without carrying cap-table or buffer contents).
    sessionContextId @4 :Data;
    # Monotonic kernel timestamp at death.
    timestampNs @5 :UInt64;
}
```

Fields explicitly **not** included: raw cap IDs, cap-table slot contents,
userspace buffer bytes, kernel heap pointers, or any data from the process's
address space beyond the fault offset.  The crash record is attributed to the
process's session context ID so it can be correlated with prior `AuditLog`
records without exposing the full cap graph.

The crash record is delivered through the same `AuditLog.record` path the
hardware-audit service already uses: the supervisor holds the `AuditLog` cap;
the kernel invokes it on the supervisor's ring (via a kernel-initiated RECV)
rather than on a shared global ring.

### Bounded Restart Policy

The supervisor that spawned a failed process owns the restart decision.  The
restart budget is declared in the manifest's `initConfig.services` entry and
interpreted by init (or a delegated supervisor):

```text
# CUE representation (illustrative)
restart: {
    policy:       "on-failure"   # never | on-failure | always
    maxRestarts:  5              # total budget over the window
    windowSecs:   60             # sliding window for the budget
    backoffBase:  "1s"           # initial delay before first restart
    backoffMax:   "30s"          # ceiling on exponential backoff
    emergencyFallback: "shell"   # service name to promote if budget exhausted
}
```

**Backoff is bounded and service-class aware.** The exponential
`backoffBase`→`backoffMax` schedule suits user-facing services that should
self-heal without spinning (the Kubernetes `CrashLoopBackOff` lesson). For
always-available system services, the prior-art note's systemd lesson favors a
short flat delay so a transient fault recovers fast; such services set
`backoffBase == backoffMax` for flat `RestartSec`-style behavior. In both cases
`maxRestarts`/`windowSecs` is the hard give-up budget (the OTP
max-restart-intensity lesson), so neither model spins forever.

**Crash-loop detection.** If `maxRestarts` attempts exhaust within
`windowSecs`, the supervisor stops restarting and records a `budget-exhausted`
event.  The service is marked permanently failed until an operator issues an
explicit override through the `ProcessHandle` or re-spawns via a fresh
manifest reload.

**Authority preservation.** Each restart uses the original `ProcessSpawner`
call with the same `CapGrant` list that was used at initial spawn.  The
supervisor does not invent new grants or escalate authority.  If a grant source
was a `SpawnGrantSource::Kernel` DDF handle that is now invalidated (for
example, a DMA buffer whose owner quiesce failed), the restart fails closed
with a `spawn-grant-invalid` error rather than falling back to an ambient
grant.

**No resurrection of stale caps.** The restarted process receives a fresh cap
table.  The supervisor must call `CapRetarget` (from
[live-upgrade-proposal.md](live-upgrade-proposal.md)) to re-point existing
client caps to the new process.  If `CapRetarget` is not yet implemented,
clients observing `disconnected` must reconnect through the supervisor's
exported endpoint, which the supervisor re-registers after restart.

### Watchdog Capability

A service that can hang without crashing (blocked ring, infinite loop, deadlock
on a kernel-held lock it does not own) is not detected by exit-path crash
handling.  The watchdog provides periodic liveness proof:

```capnp
# Proposed future addition to schema/capos.capnp (Phase 3)

interface Watchdog {
    # Service calls this on every iteration of its main loop to
    # reset the deadline.  If not called within `timeoutNs` of the
    # last kick (or of registration), the supervisor is notified.
    kick @0 () -> ();
    # Unregister.  Safe to call during planned shutdown.
    cancel @1 () -> ();
}

interface WatchdogSource {
    # Register this process with the given timeout.
    # Returns a Watchdog the service holds and kicks.
    register @0 (processName :Text, timeoutNs :UInt64) -> (watchdogIndex :UInt16);
}
```

The supervisor grants a `Watchdog` cap (minted from a `WatchdogSource` it
holds) to each service it considers watchdog-registered.  If the kernel timer
fires without a `kick`, the supervisor receives a liveness-failure notification
and treats it identically to an unplanned crash: crash record, restart budget
check, backoff.

The watchdog is an opt-in service-level contract, not a mandatory kernel
mechanism.  Services that are inherently event-driven (blocked on `cap_enter`
waiting for an SQE) do not need a watchdog; they will return `disconnected` to
callers if they stop processing.  Watchdog is primarily useful for services
with internal polling loops or external I/O not driven by the capOS ring.

### Degraded Boot

The manifest may declare an emergency fallback service that is promoted when a
critical service exhausts its restart budget before the system reaches a
usable state:

```text
# CUE (illustrative)
degradedBoot: {
    trigger:  "net-stack"    # if this service fails permanently...
    fallback: "shell"        # ...promote this service to interactive
    timeoutSecs: 30          # deadline from kernel handoff to readiness
}
```

The init process monitors service readiness.  If a declared critical service
fails to reach readiness within the timeout and has exhausted its restart
budget, init spawns the fallback service with a console cap and an audit cap so
an operator can inspect what failed.  The fallback service is not granted the
failed service's caps; it is a scoped interactive shell, not a repair agent
with escalated authority.

## Relevant Research and Prior Art

### In-Tree Research Notes

- [docs/research/crash-recovery-supervision.md](../research/crash-recovery-supervision.md)
  is the dedicated prior-art survey for this proposal: supervision trees,
  restart budgets (OTP intensity/period, systemd `StartLimit`, Kubernetes
  `CrashLoopBackOff`), dead-server notification (Fuchsia `ZX_CHANNEL_PEER_CLOSED`,
  seL4 silence, Genode `Ipc_error`), and coredump-redaction concerns, each
  verified against primary sources.
- [docs/research/os-error-handling.md](../research/os-error-handling.md) grounds
  the stale-cap surface. It records what callers observe when a server dies in
  comparable systems: Zircon channels close and the peer observes
  `ZX_ERR_PEER_CLOSED`; Genode capabilities to a dead server become invalid and
  subsequent invocations produce `Ipc_error`; seL4 routes faults to a per-thread
  fault endpoint; KeyKOS/EROS routes them to the domain keeper. The shared
  lesson is that a dead server must surface as a *typed transport-level* signal,
  not a hung invocation — which is exactly the `disconnected` death CQE this
  proposal specifies.
- [docs/research/capnp-error-handling.md](../research/capnp-error-handling.md)
  fixes the meaning of `disconnected` in the four-kind capnp model: "connection
  to a necessary capability was lost," with the client response being
  re-establish-and-retry. This proposal reuses that existing classification
  rather than minting a new exception kind; the only addition is *when* the
  kernel emits it (unplanned death) and the paired `CrashRecord`.
- [docs/research/eros-capros-coyotos.md](../research/eros-capros-coyotos.md)
  documents the EROS/KeyKOS **keeper mechanism**: a capability to a separate
  domain that the kernel invokes on fault, which can inspect, terminate, or
  *restart* the faulting domain (process supervision is an explicit listed use).
  capOS's supervisor-owns-`ProcessHandle` model is the same shape with capnp
  typed methods instead of a keeper key, and the kernel never initiates the
  restart itself.
- [docs/research/genode.md](../research/genode.md) and
  [docs/research/sel4.md](../research/sel4.md) ground the no-resurrected-authority
  invariant: Genode's parent-supervised component tree with revocable
  capabilities, and seL4's hierarchical delegation plus `Revoke` over the
  capability derivation tree, both establish that a restarted child gets fresh
  authority and that revocation of the dead instance's caps is the supervisor's
  (parent's) responsibility, not an ambient lookup.

### External Precedent and Lessons

- **Erlang/OTP supervision trees.** The "let it crash" philosophy plus
  supervisor restart strategies (`one_for_one`, `one_for_all`, `rest_for_one`)
  and **max-restart-intensity** (`MaxR` restarts within `MaxT` seconds, after
  which the supervisor itself terminates) are the direct precedent for this
  proposal's per-service failure budget and crash-loop detection. The lesson:
  bound restarts over a sliding window and escalate (here: stop and mark
  permanently failed, optionally promote degraded boot) rather than loop
  forever.
- **systemd unit restart policy.** `Restart=on-failure|always`,
  `RestartSec` backoff, and `StartLimitIntervalSec`/`StartLimitBurst` are the
  precedent for the `policy`/`backoffBase`/`maxRestarts`/`windowSecs` fields.
  The lesson: separate the *whether* (policy) from the *pacing* (backoff) from
  the *give-up threshold* (burst limit).
- **Kubernetes liveness/readiness probes and CrashLoopBackoff.** Liveness
  probes (kubelet restarts a container that fails its probe) are the precedent
  for the `Watchdog.kick`/timeout design; readiness gating before promotion is
  the precedent for the degraded-boot readiness deadline; `CrashLoopBackOff`
  with exponential backoff is the precedent for capped exponential restart
  delay. The lesson: liveness is opt-in and orthogonal to crash detection — a
  hung-but-not-dead process needs an explicit liveness signal.
- **Fuchsia component lifecycle.** Component-manager-driven start/stop and
  rebinding in the routing graph parallel capOS's supervisor + `CapRetarget`
  reconnection. The dedicated research note above grounds Fuchsia's
  death-observation behavior (`ZX_CHANNEL_PEER_CLOSED`, no implicit reconnect); a
  deeper write-up of Fuchsia *component-manager restart and escrow semantics*
  remains research-needed (per the `docs/backlog/research-design-gaps.md`
  convention) before this proposal cites specific escrow behavior as grounding.

## Phasing

**Phase 1 — Stale-cap DISCONNECTED propagation and crash record (most
model-critical).**  The death CQE for in-flight CALLs is the highest-priority
item because it closes the model gap: callers can observe server death as a
typed transport error rather than a hung ring.  Crash record delivery to the
supervisor's `AuditLog` is paired here because it uses the same kernel death
path.  Requires: epoch-revocation from Stage 6 (done), `AuditLog` cap
(done), `CrashRecord` schema addition.

**Phase 2 — Bounded restart policy and crash-loop detection.**  Init reads
restart budget fields from `initConfig.services`, applies exponential backoff,
and stops at budget exhaustion.  Requires: Phase 1 crash record so init knows
whether a death was planned or unplanned; `CapRetarget` from live-upgrade Phase
1 to reconnect client caps after restart.

**Phase 3 — Watchdog capability.**  `WatchdogSource` and `Watchdog` schema,
kernel timer integration, supervisor-side timeout detection, and liveness-failure
events fed into the same restart budget path as Phase 2.

**Phase 4 — Degraded boot.**  Manifest parser reads `degradedBoot` fields;
init promotes the fallback service on budget exhaustion during the boot window.
Requires: Phase 2 budget tracking.

## Hazards and Invariants

**Frame reuse ordering.** The kernel must not return frames from a dead
process's address space to the frame allocator until all outstanding
`disconnected` CQEs for that process's caps have been delivered.  Violating
this could allow a concurrent `FrameAlloc` to map recycled memory into a new
process before the old process's CQEs complete, creating a window where a
stale `disconnected` CQE arrives after the frame holds new data.  The existing
DMA quiesce/scrub ordering in the DMA pool grant path is the model for this
constraint.

**No stale authority after restart.** A restarted process receives only the
grants declared in the original `ProcessSpawner.spawn` call.  The supervisor
must not silently re-grant caps that were revoked as part of the death epoch
bump.  In particular, any `DMAPool`-derived handle that was in active use at
crash time must be explicitly re-acquired through the grant-source path, not
recycled from the dead process's cap table.

**Restart does not bypass the authority broker.** If the original spawn was
gated on an `AuthorityBroker`-selected session context, the restart uses the
same broker path.  The supervisor cannot substitute a broader session context
or an anonymous context to make the restart succeed.

**Capability revocation precedes any dump.** The death epoch bump that
invalidates the crashed process's caps must complete before any crash record or
future coredump is produced. A record produced post-revocation sees only dead
cap indices, never live authority; a pre-revocation memory snapshot could
otherwise capture live cap indices or ring-buffer contents (the race class
behind recent coredump CVEs). Any future coredump extension must run only after
revocation and must not be readable by unprivileged dump readers.

**Crash record isolation.** The crash record must not carry raw cap IDs, cap
table slot numbers, or any data read from the process's address space (stack
contents, heap contents, message buffers).  The fault offset is relative to
the binary load base, not an absolute virtual address, to avoid leaking kernel
layout or userspace address randomization.

**Watchdog authority is narrow.** A `Watchdog` cap proves liveness for exactly
one registered process.  It does not grant the holder any access to the
supervisor, the process's caps, or any other service.  It is a pure liveness
signal, not an authority surface.

## Relationship to Adjacent Proposals

- [live-upgrade-proposal.md](live-upgrade-proposal.md) — covers the planned
  case.  The `CapRetarget` primitive defined there is consumed by Phase 2 of
  this proposal to reconnect client caps after an unplanned restart.
  The force-mode `disconnected` delivery and epoch-revocation paths are
  shared; this proposal adds the death CQE and crash record on top.
- [service-architecture-proposal.md](service-architecture-proposal.md) —
  defines the supervisor tree and the `RestartPolicy` type currently parsed
  by init.  This proposal extends that policy with the budget, backoff, and
  budget-exhaustion fields, and binds crash handling to the supervisor that
  owns the `ProcessHandle`, not to a global daemon.
- [capos-service-proposal.md](capos-service-proposal.md) — defines the
  userspace service framework above `capos-rt`.  The watchdog `kick` call and
  readiness notification in Phase 3 are natural additions to the service
  lifecycle hooks that `capos-service` abstracts.
- [error-handling-proposal.md](error-handling-proposal.md) — the `disconnected`
  class in the two-level error model is the transport surface for stale-cap
  delivery.  This proposal does not add new error types; it specifies when and
  how `disconnected` is delivered for an unplanned death.
- [system-monitoring-proposal.md](system-monitoring-proposal.md) — crash
  records, restart events, budget-exhaustion notifications, and watchdog
  timeouts are all audit-worthy.  The monitoring proposal owns the operator
  visibility surface; this proposal defines the structured events that feed it.
- [resource-accounting-proposal.md](resource-accounting-proposal.md) — the
  failure budget is a quota: a count consumed by crash events and refilled by
  the sliding window.  The accounting model for this quota follows the same
  ledger-of-record pattern as memory and scheduling quotas.
