Crash Recovery and Supervision: Prior-Art Survey
Survey of crash recovery, supervision, and failure propagation patterns across production systems. Used as input for the capOS Crash Recovery proposal.
1. Erlang/OTP Supervision Trees
Erlang/OTP is the canonical prior art for declarative crash recovery in a capability-shaped process model.
Supervision strategies
A supervisor declares one of four restart strategies:
one_for_one: only the crashed child is restarted; siblings are unaffected.one_for_all: when any child crashes, every child is terminated and then every child is restarted. Used when children have shared state.rest_for_one: the crashed child and all children started after it (in declaration order) are terminated and restarted. Used when later children depend on earlier ones.simple_one_for_one: a simplifiedone_for_onefor dynamically added homogeneous workers.
Restart intensity
Supervisors carry an intensity (max restart count) and period (seconds
window). If more than intensity restarts occur in any rolling period-second
window, the supervisor terminates all children and then itself, escalating the
failure to its own parent supervisor. The defaults are intensity = 1 and
period = 5; that is, one restart per five seconds before the supervisor
gives up.
Each child spec declares a restart type:
permanent— always restarted.transient— restarted only on abnormal exit (exit reason other thannormal,shutdown, or{shutdown, Term}).temporary— never restarted.
“Let it crash”
The design philosophy is to avoid defensive error-handling at the crash site. A process that encounters an unexpected condition should exit cleanly, relying on its supervisor to restart it in a known-good state. Error recovery code introduces its own bugs; a clean restart from a known-good init is safer.
Linked processes propagate EXIT signals bidirectionally. A supervisor traps
exits (process_flag(trap_exit, true)) and converts them to ordinary messages
{'EXIT', Pid, Reason}, allowing it to react rather than crash itself. Monitors
(erlang:monitor/2) give a unidirectional {'DOWN', Ref, process, Pid, Reason}
without the bidirectional link risk.
Lesson for capOS
- Restart budgets (intensity + period) translate directly: the kernel service supervisor should maintain a crash-loop budget — max N restarts per T seconds — and escalate to a parent authority or enter degraded boot if exceeded.
- The three child restart types (
permanent/transient/temporary) match the restart policy field a capOS service manifest would declare. - “Let it crash” applies: a capability server that encounters an unexpected
decode error or illegal state should exit rather than continue with corrupted
internal state. The supervisor restarts it; stale client caps observe a
DisconnectedCQE before the server is live again.
2. systemd Service Recovery
systemd is the dominant Linux service supervisor. Its restart model is policy-driven, external to the service.
Restart= modes
The Restart= directive accepts: no (default), on-success, on-failure,
on-abnormal, on-watchdog, on-abort, or always.
on-failurecovers non-zero exit codes, signals (including core dump), and watchdog timeout — the common production choice.on-abnormalcovers signals, operation timeouts, and watchdog, but not non-zero exit codes.alwaysrestarts unconditionally.
Timing
RestartSec (default 100 ms) is the delay before a restart attempt. It is
not a backoff — it is a flat delay between each attempt.
Crash-loop budget
StartLimitIntervalSec (default 10 s) and StartLimitBurst (default 5) form
the crash-loop budget: more than StartLimitBurst starts within
StartLimitIntervalSec puts the unit in a permanently failed state until
manually reset or the system reboots. This is the systemd analogue of OTP
intensity/period.
Dependency cascades
OnFailure= lists units to activate when a service enters the failed state;
it is typically used to run a notification or diagnostic unit.
Watchdog
WatchdogSec enables a software watchdog: the service must call
sd_notify(0, "WATCHDOG=1") at intervals shorter than WatchdogSec. If the
heartbeat is absent for the full interval, systemd kills and (if Restart=
includes watchdog triggers) restarts the service. This catches live-lock and
hang states that do not produce a crash signal.
Lesson for capOS
- A capability service watchdog translates to a periodic
sd_notify-style ping to a watchdog capability. If the server does not renew within a budget, the supervisor sendsSIGKILL(or the kernel analogue) and restarts. - The crash-loop budget (
StartLimitIntervalSec/StartLimitBurst) is the second time this pattern appears, reinforcing that a fixed restart budget per time window is the correct primitive. RestartSec(flat delay, not exponential) is simpler than Kubernetes backoff and appropriate for always-available system services.
3. Kubernetes: Probes and CrashLoopBackOff
Kubernetes separates health probes (liveness, readiness, startup) from the container restart policy, giving operators fine-grained control.
Probes
- Liveness probe: if it fails, kubelet kills the container and subjects it to the restart policy. Used to detect live-lock (process alive, making no progress).
- Readiness probe: if it fails, the pod’s IP is removed from all matching Service EndpointSlices. No restart is triggered; the pod stays up but receives no traffic.
- Startup probe: disables liveness and readiness probes until it succeeds, giving slow-starting containers time to initialize without being killed prematurely.
RestartPolicy
Always, OnFailure, or Never. With Always or OnFailure, a failed
container is restarted with exponential backoff: 10 s, 20 s, 40 s, … capped
at 5 minutes. If the container runs successfully for 10 minutes, the backoff
counter resets.
CrashLoopBackOff
When the restart backoff delay is active and the pod is waiting before the
next attempt, the pod status shows CrashLoopBackOff. It is not a terminal
state — the pod will still be restarted — but it indicates the container is
stuck in a restart loop and kubelet is applying backoff.
Lesson for capOS
- The readiness/liveness split maps cleanly: a capOS service can expose two status indicators — “alive” (process is running and heartbeating) and “ready” (service is accepting new capability requests). Supervisors and routing layers can use them independently.
- Exponential backoff with a cap (10 s → 5 min) and a reset window (10 min healthy) is appropriate for user-facing services that should self-heal but not spin continuously.
- The startup probe concept is relevant for services whose init phase takes longer than the steady-state heartbeat budget.
4. Fuchsia Component Framework
Fuchsia’s Component Framework manages component lifecycles and capability routing between components.
Lifecycle states
A component instance progresses through: Created → Resolved → Started → Stopped → (Shutdown) → Destroyed. Stopping preserves persistent state; Destroyed removes it entirely.
Client observation of a crashed component
When a Fuchsia component crashes, the kernel pauses the faulting thread and
delivers a message to registered exception channels. The component’s process
is killed (as if via zx_task_kill()), which closes all Zircon channels held
by that process. Clients observing those channels receive
ZX_CHANNEL_PEER_CLOSED. Component manager receives ZX_CHANNEL_PEER_CLOSED
on the runner channel for the component, allowing it to detect and log the
crash.
Clients that were bound to a crashed component’s exposed protocol channels
also observe ZX_CHANNEL_PEER_CLOSED. Component manager then handles
restarting the component (if configured). A new binding request after restart
provides a fresh channel — there is no automatic reconnection of the
pre-crash channel.
Lesson for capOS
- The Fuchsia model confirms that the clean contract for server death in a
capability system is channel close / peer-closed on all outstanding
client channels. capOS should emit a
DisconnectedCQE to every caller that has a pending request or open session to a server that dies. - There is no implicit re-connect: the client must explicitly re-acquire a new capability to the restarted service. Stale caps acquired before the crash must not be silently re-animated after restart.
5. Microkernel Precedent: seL4 and Genode
seL4
seL4 provides no built-in mechanism to notify a client when the process that holds an endpoint dies. A thread fault (capability fault, VM fault, etc.) triggers the thread’s configured fault endpoint, which notifies a designated fault-handler process. The fault handler can fix and resume, or terminate the faulting thread. However, this is per-thread fault delivery — not a general “server died, notify clients” mechanism.
If a server process is killed (all its capabilities revoked, its CNode
destroyed), outstanding seL4_Call callers remain blocked on the endpoint
permanently unless the endpoint object itself is also destroyed or a reply
capability is used. seL4 has no automatic dead-server notification for
waiting callers. Building supervision requires explicit userspace monitors
(e.g., a watchdog thread with a notification capability polled by the
supervisor).
Genode
Genode’s component model gives the parent ultimate control over its children.
When a component is destroyed (whether intentionally by the parent or due to a
crash), the kernel invalidates all capabilities whose associated RPC object is
destroyed, as a direct side effect of object destruction. Subsequent invocations
of those capabilities by other components produce an Ipc_error exception at the
call site.
The parent observes a graceful exit via the exit() RPC on the parent
interface; it receives no explicit crash notification from the kernel. Detecting
unexpected death requires the parent to poll state reports or use the heartbeat
mechanism in Genode’s init component, which tracks skipped_heartbeats per
monitored child.
Lesson for capOS
- seL4’s silence-on-server-death confirms the gap: callers must not be
silently blocked forever when a server dies. capOS must deliver a
DisconnectedCQE (or equivalent transport-level error) to every pending caller when the server capability is revoked or the process exits. - Genode’s implicit capability invalidation on object destruction is the
right kernel primitive: the kernel, not userspace, ensures no stale cap
can reach a destroyed object. capOS already has this via
CapTablerevocation. - Active death notification to a supervisor capability (rather than polling) is the correct extension — analogous to OTP process monitors.
6. Coredump and Minidump: Capture and Redaction
Core dumps contain a complete snapshot of a process’s address space at the
time of the crash. The Linux kernel writes them via core_pattern; systemd
routes them through systemd-coredump running as a socket-activated service
to enforce access controls and journaling.
The primary security concern is that capability keys, cryptographic material,
and user credentials present in process memory at crash time are written
verbatim to the dump file. systemd-coredump stores dumps in a mode readable
only by root and the process owner, but it provides no built-in redaction of
sensitive memory regions. Disabling core dumps (ulimit -c 0) for
security-sensitive services is the common mitigation.
Two recent vulnerabilities (CVE-2025-4598 in systemd-coredump and CVE-2025-5054 in Apport) demonstrate that race conditions in coredump handlers can allow local privilege escalation via sensitive memory access.
Lesson for capOS
- A capability OS dump is structurally more dangerous than a POSIX dump: the crashed process’s CapTable may contain live capabilities to kernel resources that the dump reader does not possess. Dumping capability indices without revocation could allow replay.
- The correct policy on process crash is to revoke all capabilities of the crashed process before writing any dump — the kernel holds the only authoritative revocation path. A dump tool operating post-revocation sees only dead cap indices, not live authority.
- Memory regions tagged as containing key material (capability ring buffers,
decrypted secrets) should be excluded from dumps; a
MADV_DONTDUMPanalogue applied to sensitive pages at allocation time is the mechanism.
Applicability to capOS
Across all surveyed systems, four design invariants recur:
-
Crash-loop budget. Every production supervisor limits restarts per time window (OTP
intensity/period; systemdStartLimitBurst/StartLimitIntervalSec; Kubernetes CrashLoopBackOff backoff). capOS service manifests should carry amaxRestarts+restartWindowSecsbudget; on exhaustion the supervisor enters a degraded-boot state rather than spinning. -
Dead-server notification is the kernel’s job. seL4 and Genode both demonstrate what happens when the kernel is silent: callers block forever or receive opaque errors. capOS must emit a
DisconnectedCQE to pending callers when a server’s capability is revoked, and must revoke server capabilities atomically on process exit. -
No stale authority after restart. A restarted service gets new capabilities — it does not inherit the pre-crash CapTable. Clients must re-acquire capabilities to the new instance. The Fuchsia model (fresh channel on new binding) and OTP model (new process Pid, old monitors fire
DOWN) both enforce this. -
Watchdog caps complement passive monitoring. systemd’s
WatchdogSecand Genode’s heartbeat mechanism both address live-lock states that produce no crash signal. A watchdog capability that the service must renew periodically is the capOS translation: if the service fails to renew, the supervisor kills and restarts it.
Sources
- Erlang OTP Supervisor Behaviour: https://www.erlang.org/doc/system/sup_princ.html
- Erlang stdlib supervisor module: https://www.erlang.org/doc/apps/stdlib/supervisor.html
- systemd.service(5) man page (Debian): https://manpages.debian.org/jessie/systemd/systemd.service.5.en.html
- Kubernetes Pod Lifecycle: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
- Kubernetes Probes: https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/
- Fuchsia Component Lifecycle: https://fuchsia.dev/fuchsia-src/concepts/components/v2/lifecycle
- Fuchsia Exception Handling: https://fuchsia.dev/fuchsia-src/concepts/kernel/exceptions
- Fuchsia Component Runner FIDL: https://fuchsia.dev/reference/fidl/fuchsia.component.runner
- seL4 Fault Handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
- seL4 IPC Tutorial: https://docs.sel4.systems/Tutorials/ipc.html
- Genode Recursive System Structure: https://genode.org/documentation/genode-foundations/20.05/architecture/Recursive_system_structure.html
- Genode Init Component: https://genode.org/documentation/genode-foundations/21.05/system_configuration/The_init_component.html
- systemd-coredump documentation: https://systemd.io/COREDUMP/
- CVE-2025-4598 systemd-coredump analysis: https://blogs.oracle.com/linux/analysis-of-cve-2025-4598
- Core dump security (Kicksecure): https://www.kicksecure.com/wiki/Core_Dumps