# Crash Recovery and Supervision: Prior-Art Survey

Survey of crash recovery, supervision, and failure propagation patterns across
production systems. Used as input for the capOS Crash Recovery proposal.

---

## 1. Erlang/OTP Supervision Trees

Erlang/OTP is the canonical prior art for declarative crash recovery in a
capability-shaped process model.

### Supervision strategies

A supervisor declares one of four restart strategies:

- **`one_for_one`**: only the crashed child is restarted; siblings are
  unaffected.
- **`one_for_all`**: when any child crashes, every child is terminated and
  then every child is restarted. Used when children have shared state.
- **`rest_for_one`**: the crashed child and all children started after it
  (in declaration order) are terminated and restarted. Used when later
  children depend on earlier ones.
- **`simple_one_for_one`**: a simplified `one_for_one` for dynamically added
  homogeneous workers.

### Restart intensity

Supervisors carry an `intensity` (max restart count) and `period` (seconds
window). If more than `intensity` restarts occur in any rolling `period`-second
window, the supervisor terminates all children and then itself, escalating the
failure to its own parent supervisor. The defaults are `intensity = 1` and
`period = 5`; that is, one restart per five seconds before the supervisor
gives up.

Each child spec declares a `restart` type:

- `permanent` — always restarted.
- `transient` — restarted only on abnormal exit (exit reason other than
  `normal`, `shutdown`, or `{shutdown, Term}`).
- `temporary` — never restarted.

### "Let it crash"

The design philosophy is to avoid defensive error-handling at the crash site.
A process that encounters an unexpected condition should exit cleanly, relying
on its supervisor to restart it in a known-good state. Error recovery code
introduces its own bugs; a clean restart from a known-good init is safer.

Linked processes propagate `EXIT` signals bidirectionally. A supervisor traps
exits (`process_flag(trap_exit, true)`) and converts them to ordinary messages
`{'EXIT', Pid, Reason}`, allowing it to react rather than crash itself. Monitors
(`erlang:monitor/2`) give a unidirectional `{'DOWN', Ref, process, Pid, Reason}`
without the bidirectional link risk.

### Lesson for capOS

- Restart budgets (intensity + period) translate directly: the kernel service
  supervisor should maintain a crash-loop budget — max N restarts per T seconds
  — and escalate to a parent authority or enter degraded boot if exceeded.
- The three child restart types (`permanent`/`transient`/`temporary`) match the
  restart policy field a capOS service manifest would declare.
- "Let it crash" applies: a capability server that encounters an unexpected
  decode error or illegal state should exit rather than continue with corrupted
  internal state. The supervisor restarts it; stale client caps observe a
  `Disconnected` CQE before the server is live again.

---

## 2. systemd Service Recovery

systemd is the dominant Linux service supervisor. Its restart model is
policy-driven, external to the service.

### Restart= modes

The `Restart=` directive accepts: `no` (default), `on-success`, `on-failure`,
`on-abnormal`, `on-watchdog`, `on-abort`, or `always`.

- `on-failure` covers non-zero exit codes, signals (including core dump), and
  watchdog timeout — the common production choice.
- `on-abnormal` covers signals, operation timeouts, and watchdog, but not
  non-zero exit codes.
- `always` restarts unconditionally.

### Timing

`RestartSec` (default 100 ms) is the delay before a restart attempt. It is
not a backoff — it is a flat delay between each attempt.

### Crash-loop budget

`StartLimitIntervalSec` (default 10 s) and `StartLimitBurst` (default 5) form
the crash-loop budget: more than `StartLimitBurst` starts within
`StartLimitIntervalSec` puts the unit in a permanently failed state until
manually reset or the system reboots. This is the systemd analogue of OTP
`intensity`/`period`.

### Dependency cascades

`OnFailure=` lists units to activate when a service enters the failed state;
it is typically used to run a notification or diagnostic unit.

### Watchdog

`WatchdogSec` enables a software watchdog: the service must call
`sd_notify(0, "WATCHDOG=1")` at intervals shorter than `WatchdogSec`. If the
heartbeat is absent for the full interval, systemd kills and (if `Restart=`
includes watchdog triggers) restarts the service. This catches live-lock and
hang states that do not produce a crash signal.

### Lesson for capOS

- A capability service watchdog translates to a periodic `sd_notify`-style
  ping to a watchdog capability. If the server does not renew within a
  budget, the supervisor sends `SIGKILL` (or the kernel analogue) and
  restarts.
- The crash-loop budget (`StartLimitIntervalSec`/`StartLimitBurst`) is the
  second time this pattern appears, reinforcing that a fixed restart budget
  per time window is the correct primitive.
- `RestartSec` (flat delay, not exponential) is simpler than Kubernetes backoff
  and appropriate for always-available system services.

---

## 3. Kubernetes: Probes and CrashLoopBackOff

Kubernetes separates health probes (liveness, readiness, startup) from the
container restart policy, giving operators fine-grained control.

### Probes

- **Liveness probe**: if it fails, kubelet kills the container and subjects it
  to the restart policy. Used to detect live-lock (process alive, making no
  progress).
- **Readiness probe**: if it fails, the pod's IP is removed from all matching
  Service EndpointSlices. No restart is triggered; the pod stays up but
  receives no traffic.
- **Startup probe**: disables liveness and readiness probes until it succeeds,
  giving slow-starting containers time to initialize without being killed
  prematurely.

### RestartPolicy

`Always`, `OnFailure`, or `Never`. With `Always` or `OnFailure`, a failed
container is restarted with exponential backoff: 10 s, 20 s, 40 s, ... capped
at 5 minutes. If the container runs successfully for 10 minutes, the backoff
counter resets.

### CrashLoopBackOff

When the restart backoff delay is active and the pod is waiting before the
next attempt, the pod status shows `CrashLoopBackOff`. It is not a terminal
state — the pod will still be restarted — but it indicates the container is
stuck in a restart loop and kubelet is applying backoff.

### Lesson for capOS

- The readiness/liveness split maps cleanly: a capOS service can expose two
  status indicators — "alive" (process is running and heartbeating) and "ready"
  (service is accepting new capability requests). Supervisors and routing layers
  can use them independently.
- Exponential backoff with a cap (10 s → 5 min) and a reset window (10 min
  healthy) is appropriate for user-facing services that should self-heal but
  not spin continuously.
- The startup probe concept is relevant for services whose init phase takes
  longer than the steady-state heartbeat budget.

---

## 4. Fuchsia Component Framework

Fuchsia's Component Framework manages component lifecycles and capability
routing between components.

### Lifecycle states

A component instance progresses through: Created → Resolved → Started →
Stopped → (Shutdown) → Destroyed. Stopping preserves persistent state;
Destroyed removes it entirely.

### Client observation of a crashed component

When a Fuchsia component crashes, the kernel pauses the faulting thread and
delivers a message to registered exception channels. The component's process
is killed (as if via `zx_task_kill()`), which closes all Zircon channels held
by that process. Clients observing those channels receive
`ZX_CHANNEL_PEER_CLOSED`. Component manager receives `ZX_CHANNEL_PEER_CLOSED`
on the runner channel for the component, allowing it to detect and log the
crash.

Clients that were bound to a crashed component's exposed protocol channels
also observe `ZX_CHANNEL_PEER_CLOSED`. Component manager then handles
restarting the component (if configured). A new binding request after restart
provides a fresh channel — there is no automatic reconnection of the
pre-crash channel.

### Lesson for capOS

- The Fuchsia model confirms that the clean contract for server death in a
  capability system is channel close / peer-closed on all outstanding
  client channels. capOS should emit a `Disconnected` CQE to every caller
  that has a pending request or open session to a server that dies.
- There is no implicit re-connect: the client must explicitly re-acquire a
  new capability to the restarted service. Stale caps acquired before the
  crash must not be silently re-animated after restart.

---

## 5. Microkernel Precedent: seL4 and Genode

### seL4

seL4 provides no built-in mechanism to notify a client when the process that
holds an endpoint dies. A thread fault (capability fault, VM fault, etc.)
triggers the thread's configured fault endpoint, which notifies a designated
fault-handler process. The fault handler can fix and resume, or terminate the
faulting thread. However, this is per-thread fault delivery — not a general
"server died, notify clients" mechanism.

If a server process is killed (all its capabilities revoked, its CNode
destroyed), outstanding `seL4_Call` callers remain blocked on the endpoint
permanently unless the endpoint object itself is also destroyed or a reply
capability is used. seL4 has no automatic dead-server notification for
waiting callers. Building supervision requires explicit userspace monitors
(e.g., a watchdog thread with a notification capability polled by the
supervisor).

### Genode

Genode's component model gives the parent ultimate control over its children.
When a component is destroyed (whether intentionally by the parent or due to a
crash), the kernel invalidates all capabilities whose associated RPC object is
destroyed, as a direct side effect of object destruction. Subsequent invocations
of those capabilities by other components produce an `Ipc_error` exception at the
call site.

The parent observes a graceful exit via the `exit()` RPC on the parent
interface; it receives no explicit crash notification from the kernel. Detecting
unexpected death requires the parent to poll state reports or use the heartbeat
mechanism in Genode's `init` component, which tracks `skipped_heartbeats` per
monitored child.

### Lesson for capOS

- seL4's silence-on-server-death confirms the gap: callers must not be
  silently blocked forever when a server dies. capOS must deliver a
  `Disconnected` CQE (or equivalent transport-level error) to every pending
  caller when the server capability is revoked or the process exits.
- Genode's implicit capability invalidation on object destruction is the
  right kernel primitive: the kernel, not userspace, ensures no stale cap
  can reach a destroyed object. capOS already has this via `CapTable` revocation.
- Active death notification to a supervisor capability (rather than polling)
  is the correct extension — analogous to OTP process monitors.

---

## 6. Coredump and Minidump: Capture and Redaction

Core dumps contain a complete snapshot of a process's address space at the
time of the crash. The Linux kernel writes them via `core_pattern`; systemd
routes them through `systemd-coredump` running as a socket-activated service
to enforce access controls and journaling.

The primary security concern is that capability keys, cryptographic material,
and user credentials present in process memory at crash time are written
verbatim to the dump file. systemd-coredump stores dumps in a mode readable
only by root and the process owner, but it provides no built-in redaction of
sensitive memory regions. Disabling core dumps (`ulimit -c 0`) for
security-sensitive services is the common mitigation.

Two recent vulnerabilities (CVE-2025-4598 in systemd-coredump and
CVE-2025-5054 in Apport) demonstrate that race conditions in coredump handlers
can allow local privilege escalation via sensitive memory access.

### Lesson for capOS

- A capability OS dump is structurally more dangerous than a POSIX dump: the
  crashed process's CapTable may contain live capabilities to kernel resources
  that the dump reader does not possess. Dumping capability indices without
  revocation could allow replay.
- The correct policy on process crash is to revoke all capabilities of the
  crashed process before writing any dump — the kernel holds the only
  authoritative revocation path. A dump tool operating post-revocation sees
  only dead cap indices, not live authority.
- Memory regions tagged as containing key material (capability ring buffers,
  decrypted secrets) should be excluded from dumps; a `MADV_DONTDUMP` analogue
  applied to sensitive pages at allocation time is the mechanism.

---

## Applicability to capOS

Across all surveyed systems, four design invariants recur:

1. **Crash-loop budget.** Every production supervisor limits restarts per time
   window (OTP `intensity`/`period`; systemd `StartLimitBurst`/
   `StartLimitIntervalSec`; Kubernetes CrashLoopBackOff backoff). capOS service
   manifests should carry a `maxRestarts` + `restartWindowSecs` budget; on
   exhaustion the supervisor enters a degraded-boot state rather than spinning.

2. **Dead-server notification is the kernel's job.** seL4 and Genode both
   demonstrate what happens when the kernel is silent: callers block forever or
   receive opaque errors. capOS must emit a `Disconnected` CQE to pending
   callers when a server's capability is revoked, and must revoke server
   capabilities atomically on process exit.

3. **No stale authority after restart.** A restarted service gets new
   capabilities — it does not inherit the pre-crash CapTable. Clients must
   re-acquire capabilities to the new instance. The Fuchsia model (fresh channel
   on new binding) and OTP model (new process Pid, old monitors fire `DOWN`)
   both enforce this.

4. **Watchdog caps complement passive monitoring.** systemd's `WatchdogSec` and
   Genode's heartbeat mechanism both address live-lock states that produce no
   crash signal. A watchdog capability that the service must renew periodically
   is the capOS translation: if the service fails to renew, the supervisor
   kills and restarts it.

---

## Sources

- Erlang OTP Supervisor Behaviour: https://www.erlang.org/doc/system/sup_princ.html
- Erlang stdlib supervisor module: https://www.erlang.org/doc/apps/stdlib/supervisor.html
- systemd.service(5) man page (Debian): https://manpages.debian.org/jessie/systemd/systemd.service.5.en.html
- Kubernetes Pod Lifecycle: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
- Kubernetes Probes: https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/
- Fuchsia Component Lifecycle: https://fuchsia.dev/fuchsia-src/concepts/components/v2/lifecycle
- Fuchsia Exception Handling: https://fuchsia.dev/fuchsia-src/concepts/kernel/exceptions
- Fuchsia Component Runner FIDL: https://fuchsia.dev/reference/fidl/fuchsia.component.runner
- seL4 Fault Handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
- seL4 IPC Tutorial: https://docs.sel4.systems/Tutorials/ipc.html
- Genode Recursive System Structure: https://genode.org/documentation/genode-foundations/20.05/architecture/Recursive_system_structure.html
- Genode Init Component: https://genode.org/documentation/genode-foundations/21.05/system_configuration/The_init_component.html
- systemd-coredump documentation: https://systemd.io/COREDUMP/
- CVE-2025-4598 systemd-coredump analysis: https://blogs.oracle.com/linux/analysis-of-cve-2025-4598
- Core dump security (Kicksecure): https://www.kicksecure.com/wiki/Core_Dumps