Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Crash Recovery and Supervision: Prior-Art Survey

Survey of crash recovery, supervision, and failure propagation patterns across production systems. Used as input for the capOS Crash Recovery proposal.


1. Erlang/OTP Supervision Trees

Erlang/OTP is the canonical prior art for declarative crash recovery in a capability-shaped process model.

Supervision strategies

A supervisor declares one of four restart strategies:

  • one_for_one: only the crashed child is restarted; siblings are unaffected.
  • one_for_all: when any child crashes, every child is terminated and then every child is restarted. Used when children have shared state.
  • rest_for_one: the crashed child and all children started after it (in declaration order) are terminated and restarted. Used when later children depend on earlier ones.
  • simple_one_for_one: a simplified one_for_one for dynamically added homogeneous workers.

Restart intensity

Supervisors carry an intensity (max restart count) and period (seconds window). If more than intensity restarts occur in any rolling period-second window, the supervisor terminates all children and then itself, escalating the failure to its own parent supervisor. The defaults are intensity = 1 and period = 5; that is, one restart per five seconds before the supervisor gives up.

Each child spec declares a restart type:

  • permanent — always restarted.
  • transient — restarted only on abnormal exit (exit reason other than normal, shutdown, or {shutdown, Term}).
  • temporary — never restarted.

“Let it crash”

The design philosophy is to avoid defensive error-handling at the crash site. A process that encounters an unexpected condition should exit cleanly, relying on its supervisor to restart it in a known-good state. Error recovery code introduces its own bugs; a clean restart from a known-good init is safer.

Linked processes propagate EXIT signals bidirectionally. A supervisor traps exits (process_flag(trap_exit, true)) and converts them to ordinary messages {'EXIT', Pid, Reason}, allowing it to react rather than crash itself. Monitors (erlang:monitor/2) give a unidirectional {'DOWN', Ref, process, Pid, Reason} without the bidirectional link risk.

Lesson for capOS

  • Restart budgets (intensity + period) translate directly: the kernel service supervisor should maintain a crash-loop budget — max N restarts per T seconds — and escalate to a parent authority or enter degraded boot if exceeded.
  • The three child restart types (permanent/transient/temporary) match the restart policy field a capOS service manifest would declare.
  • “Let it crash” applies: a capability server that encounters an unexpected decode error or illegal state should exit rather than continue with corrupted internal state. The supervisor restarts it; stale client caps observe a Disconnected CQE before the server is live again.

2. systemd Service Recovery

systemd is the dominant Linux service supervisor. Its restart model is policy-driven, external to the service.

Restart= modes

The Restart= directive accepts: no (default), on-success, on-failure, on-abnormal, on-watchdog, on-abort, or always.

  • on-failure covers non-zero exit codes, signals (including core dump), and watchdog timeout — the common production choice.
  • on-abnormal covers signals, operation timeouts, and watchdog, but not non-zero exit codes.
  • always restarts unconditionally.

Timing

RestartSec (default 100 ms) is the delay before a restart attempt. It is not a backoff — it is a flat delay between each attempt.

Crash-loop budget

StartLimitIntervalSec (default 10 s) and StartLimitBurst (default 5) form the crash-loop budget: more than StartLimitBurst starts within StartLimitIntervalSec puts the unit in a permanently failed state until manually reset or the system reboots. This is the systemd analogue of OTP intensity/period.

Dependency cascades

OnFailure= lists units to activate when a service enters the failed state; it is typically used to run a notification or diagnostic unit.

Watchdog

WatchdogSec enables a software watchdog: the service must call sd_notify(0, "WATCHDOG=1") at intervals shorter than WatchdogSec. If the heartbeat is absent for the full interval, systemd kills and (if Restart= includes watchdog triggers) restarts the service. This catches live-lock and hang states that do not produce a crash signal.

Lesson for capOS

  • A capability service watchdog translates to a periodic sd_notify-style ping to a watchdog capability. If the server does not renew within a budget, the supervisor sends SIGKILL (or the kernel analogue) and restarts.
  • The crash-loop budget (StartLimitIntervalSec/StartLimitBurst) is the second time this pattern appears, reinforcing that a fixed restart budget per time window is the correct primitive.
  • RestartSec (flat delay, not exponential) is simpler than Kubernetes backoff and appropriate for always-available system services.

3. Kubernetes: Probes and CrashLoopBackOff

Kubernetes separates health probes (liveness, readiness, startup) from the container restart policy, giving operators fine-grained control.

Probes

  • Liveness probe: if it fails, kubelet kills the container and subjects it to the restart policy. Used to detect live-lock (process alive, making no progress).
  • Readiness probe: if it fails, the pod’s IP is removed from all matching Service EndpointSlices. No restart is triggered; the pod stays up but receives no traffic.
  • Startup probe: disables liveness and readiness probes until it succeeds, giving slow-starting containers time to initialize without being killed prematurely.

RestartPolicy

Always, OnFailure, or Never. With Always or OnFailure, a failed container is restarted with exponential backoff: 10 s, 20 s, 40 s, … capped at 5 minutes. If the container runs successfully for 10 minutes, the backoff counter resets.

CrashLoopBackOff

When the restart backoff delay is active and the pod is waiting before the next attempt, the pod status shows CrashLoopBackOff. It is not a terminal state — the pod will still be restarted — but it indicates the container is stuck in a restart loop and kubelet is applying backoff.

Lesson for capOS

  • The readiness/liveness split maps cleanly: a capOS service can expose two status indicators — “alive” (process is running and heartbeating) and “ready” (service is accepting new capability requests). Supervisors and routing layers can use them independently.
  • Exponential backoff with a cap (10 s → 5 min) and a reset window (10 min healthy) is appropriate for user-facing services that should self-heal but not spin continuously.
  • The startup probe concept is relevant for services whose init phase takes longer than the steady-state heartbeat budget.

4. Fuchsia Component Framework

Fuchsia’s Component Framework manages component lifecycles and capability routing between components.

Lifecycle states

A component instance progresses through: Created → Resolved → Started → Stopped → (Shutdown) → Destroyed. Stopping preserves persistent state; Destroyed removes it entirely.

Client observation of a crashed component

When a Fuchsia component crashes, the kernel pauses the faulting thread and delivers a message to registered exception channels. The component’s process is killed (as if via zx_task_kill()), which closes all Zircon channels held by that process. Clients observing those channels receive ZX_CHANNEL_PEER_CLOSED. Component manager receives ZX_CHANNEL_PEER_CLOSED on the runner channel for the component, allowing it to detect and log the crash.

Clients that were bound to a crashed component’s exposed protocol channels also observe ZX_CHANNEL_PEER_CLOSED. Component manager then handles restarting the component (if configured). A new binding request after restart provides a fresh channel — there is no automatic reconnection of the pre-crash channel.

Lesson for capOS

  • The Fuchsia model confirms that the clean contract for server death in a capability system is channel close / peer-closed on all outstanding client channels. capOS should emit a Disconnected CQE to every caller that has a pending request or open session to a server that dies.
  • There is no implicit re-connect: the client must explicitly re-acquire a new capability to the restarted service. Stale caps acquired before the crash must not be silently re-animated after restart.

5. Microkernel Precedent: seL4 and Genode

seL4

seL4 provides no built-in mechanism to notify a client when the process that holds an endpoint dies. A thread fault (capability fault, VM fault, etc.) triggers the thread’s configured fault endpoint, which notifies a designated fault-handler process. The fault handler can fix and resume, or terminate the faulting thread. However, this is per-thread fault delivery — not a general “server died, notify clients” mechanism.

If a server process is killed (all its capabilities revoked, its CNode destroyed), outstanding seL4_Call callers remain blocked on the endpoint permanently unless the endpoint object itself is also destroyed or a reply capability is used. seL4 has no automatic dead-server notification for waiting callers. Building supervision requires explicit userspace monitors (e.g., a watchdog thread with a notification capability polled by the supervisor).

Genode

Genode’s component model gives the parent ultimate control over its children. When a component is destroyed (whether intentionally by the parent or due to a crash), the kernel invalidates all capabilities whose associated RPC object is destroyed, as a direct side effect of object destruction. Subsequent invocations of those capabilities by other components produce an Ipc_error exception at the call site.

The parent observes a graceful exit via the exit() RPC on the parent interface; it receives no explicit crash notification from the kernel. Detecting unexpected death requires the parent to poll state reports or use the heartbeat mechanism in Genode’s init component, which tracks skipped_heartbeats per monitored child.

Lesson for capOS

  • seL4’s silence-on-server-death confirms the gap: callers must not be silently blocked forever when a server dies. capOS must deliver a Disconnected CQE (or equivalent transport-level error) to every pending caller when the server capability is revoked or the process exits.
  • Genode’s implicit capability invalidation on object destruction is the right kernel primitive: the kernel, not userspace, ensures no stale cap can reach a destroyed object. capOS already has this via CapTable revocation.
  • Active death notification to a supervisor capability (rather than polling) is the correct extension — analogous to OTP process monitors.

6. Coredump and Minidump: Capture and Redaction

Core dumps contain a complete snapshot of a process’s address space at the time of the crash. The Linux kernel writes them via core_pattern; systemd routes them through systemd-coredump running as a socket-activated service to enforce access controls and journaling.

The primary security concern is that capability keys, cryptographic material, and user credentials present in process memory at crash time are written verbatim to the dump file. systemd-coredump stores dumps in a mode readable only by root and the process owner, but it provides no built-in redaction of sensitive memory regions. Disabling core dumps (ulimit -c 0) for security-sensitive services is the common mitigation.

Two recent vulnerabilities (CVE-2025-4598 in systemd-coredump and CVE-2025-5054 in Apport) demonstrate that race conditions in coredump handlers can allow local privilege escalation via sensitive memory access.

Lesson for capOS

  • A capability OS dump is structurally more dangerous than a POSIX dump: the crashed process’s CapTable may contain live capabilities to kernel resources that the dump reader does not possess. Dumping capability indices without revocation could allow replay.
  • The correct policy on process crash is to revoke all capabilities of the crashed process before writing any dump — the kernel holds the only authoritative revocation path. A dump tool operating post-revocation sees only dead cap indices, not live authority.
  • Memory regions tagged as containing key material (capability ring buffers, decrypted secrets) should be excluded from dumps; a MADV_DONTDUMP analogue applied to sensitive pages at allocation time is the mechanism.

Applicability to capOS

Across all surveyed systems, four design invariants recur:

  1. Crash-loop budget. Every production supervisor limits restarts per time window (OTP intensity/period; systemd StartLimitBurst/ StartLimitIntervalSec; Kubernetes CrashLoopBackOff backoff). capOS service manifests should carry a maxRestarts + restartWindowSecs budget; on exhaustion the supervisor enters a degraded-boot state rather than spinning.

  2. Dead-server notification is the kernel’s job. seL4 and Genode both demonstrate what happens when the kernel is silent: callers block forever or receive opaque errors. capOS must emit a Disconnected CQE to pending callers when a server’s capability is revoked, and must revoke server capabilities atomically on process exit.

  3. No stale authority after restart. A restarted service gets new capabilities — it does not inherit the pre-crash CapTable. Clients must re-acquire capabilities to the new instance. The Fuchsia model (fresh channel on new binding) and OTP model (new process Pid, old monitors fire DOWN) both enforce this.

  4. Watchdog caps complement passive monitoring. systemd’s WatchdogSec and Genode’s heartbeat mechanism both address live-lock states that produce no crash signal. A watchdog capability that the service must renew periodically is the capOS translation: if the service fails to renew, the supervisor kills and restarts it.


Sources

  • Erlang OTP Supervisor Behaviour: https://www.erlang.org/doc/system/sup_princ.html
  • Erlang stdlib supervisor module: https://www.erlang.org/doc/apps/stdlib/supervisor.html
  • systemd.service(5) man page (Debian): https://manpages.debian.org/jessie/systemd/systemd.service.5.en.html
  • Kubernetes Pod Lifecycle: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
  • Kubernetes Probes: https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/
  • Fuchsia Component Lifecycle: https://fuchsia.dev/fuchsia-src/concepts/components/v2/lifecycle
  • Fuchsia Exception Handling: https://fuchsia.dev/fuchsia-src/concepts/kernel/exceptions
  • Fuchsia Component Runner FIDL: https://fuchsia.dev/reference/fidl/fuchsia.component.runner
  • seL4 Fault Handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
  • seL4 IPC Tutorial: https://docs.sel4.systems/Tutorials/ipc.html
  • Genode Recursive System Structure: https://genode.org/documentation/genode-foundations/20.05/architecture/Recursive_system_structure.html
  • Genode Init Component: https://genode.org/documentation/genode-foundations/21.05/system_configuration/The_init_component.html
  • systemd-coredump documentation: https://systemd.io/COREDUMP/
  • CVE-2025-4598 systemd-coredump analysis: https://blogs.oracle.com/linux/analysis-of-cve-2025-4598
  • Core dump security (Kicksecure): https://www.kicksecure.com/wiki/Core_Dumps