Proposal: Crash Recovery and Supervision
How capOS handles unplanned process failure: propagating the death to capability holders, recording a structured crash event, and restarting the service within a bounded policy — all without resurrecting stale authority.
Problem
Live upgrade covers the planned case: a supervisor
quiesces a running service, transfers state, retargets caps, and exits the old
process in a controlled sequence. Unplanned failure is different. A process
that panics, faults, or is killed by the kernel OOM path leaves no quiesce
call, no state handoff, and no ordered exit. The kernel marks the process
dead and epoch-bumps its caps, but nothing in the current model tells callers
what happened or gives the supervisor a policy-bounded path to respawn it.
The gaps are:
- Stale-cap observability. Callers holding a cap to the dead process
receive
disconnectederrors at the transport level (the epoch-revocation path from Stage 6 is in place), but there is no structured CQE event that carries crash context or lets the caller distinguish a crash from a planned termination. - Crash metadata capture. Panic location, fault address, and last SQE opcode are useful for operators but must not leak raw cap-table contents, local cap IDs, or buffer bytes, which would break the no-ambient-authority invariant.
- Bounded restart policy. Re-spawning a crashing service without a budget produces crash-loop amplification; re-spawning must use the same broker and manifest authority that the original spawn used, not an escalated path.
- Watchdog liveness. A process that hangs without crashing is not detected by crash handling alone.
- Degraded boot. If a critical service fails to start, the system needs a safe fallback rather than a silent hang.
This proposal fills these gaps without touching the live-upgrade protocol and without adding a god-object supervisor.
User Stories
- An operator running
make run-smokesees a structured crash record in the audit log when a demo service panics, not a silent stale-cap error. - A client process calling a crashed server receives a
disconnected-classCapExceptionpromptly; the process does not block indefinitely. - Init restarts a failed service up to the configured failure budget, then stops and declares the service permanently failed rather than looping forever.
- A watchdog-registered service that hangs (no panic, no exit) is detected within its timeout and restarted under the same policy.
- If the network stack fails before a shell connects, the manifest-declared emergency shell starts instead of leaving the system unresponsive.
Design
Stale-Cap Propagation
When the kernel marks a process dead (panic, fault, or explicit terminate
without a prior clean exit), it performs the same epoch-bump it already does
for released caps. The existing disconnected value in ExceptionType
covers the transport error. The new addition is a death CQE: a
CapException { type: disconnected, message: "server-death" } delivered to
any process with an outstanding CALL SQE whose target belongs to the dead
process.
From the caller’s perspective an unplanned crash looks identical to a
force-mode live upgrade that did not reattach: the in-flight CALL returns
disconnected, epoch is bumped, and any subsequent CALL on that cap also
returns disconnected until the supervisor retargets the cap to a fresh
instance. No new CQE opcode is needed; the existing two-level error model
from Error Handling is sufficient.
Invariants:
- A
disconnectedCQE on an outstanding CALL must be delivered before the kernel recycles any frame that belonged to the dead process. Frame reuse ordering is the same constraint that applies to the force-mode live-upgrade path. - A cap whose epoch has been bumped must never route a new CALL to the dead process’s address space, even transiently. The epoch check is a load fence on the per-cap generation counter before any ring dispatch.
- Endpoint client facets held by the dead process are revoked at the same epoch bump. Other processes’ client facets to the same endpoint are not affected — they route to the endpoint owner, not to the crashed client.
Crash Record Capture
When a process dies unplanned, the kernel appends a crash record to the
AuditLog cap held by the supervisor that spawned the process (not to a
global log visible to all processes). The record is structured to support
operator debugging without leaking internal kernel state:
# Proposed addition to schema/capos.capnp (Phase 1)
enum CrashKind {
panic @0; # Rust panic! path
pageFault @1; # unmapped or protection fault
generalProtection @2;
stackOverflow @3;
illegalInstruction @4;
kernelKill @5; # explicit ProcessHandle.terminate
}
struct CrashRecord {
processName @0 :Text;
kind @1 :CrashKind;
# Instruction pointer at death, relative to ELF load base.
# Absolute virtual address is NOT included to avoid leaking
# kernel-side layout or userspace ASLR seeds.
faultOffsetInBinary @2 :UInt64;
# Last SQE opcode dispatched for this process (0 = none in flight).
lastSqeOpcode @3 :UInt8;
# Session context ID of the process (opaque; matches AuditLog sessionId
# for attribution without carrying cap-table or buffer contents).
sessionContextId @4 :Data;
# Monotonic kernel timestamp at death.
timestampNs @5 :UInt64;
}
Fields explicitly not included: raw cap IDs, cap-table slot contents,
userspace buffer bytes, kernel heap pointers, or any data from the process’s
address space beyond the fault offset. The crash record is attributed to the
process’s session context ID so it can be correlated with prior AuditLog
records without exposing the full cap graph.
The crash record is delivered through the same AuditLog.record path the
hardware-audit service already uses: the supervisor holds the AuditLog cap;
the kernel invokes it on the supervisor’s ring (via a kernel-initiated RECV)
rather than on a shared global ring.
Bounded Restart Policy
The supervisor that spawned a failed process owns the restart decision. The
restart budget is declared in the manifest’s initConfig.services entry and
interpreted by init (or a delegated supervisor):
# CUE representation (illustrative)
restart: {
policy: "on-failure" # never | on-failure | always
maxRestarts: 5 # total budget over the window
windowSecs: 60 # sliding window for the budget
backoffBase: "1s" # initial delay before first restart
backoffMax: "30s" # ceiling on exponential backoff
emergencyFallback: "shell" # service name to promote if budget exhausted
}
Backoff is bounded and service-class aware. The exponential
backoffBase→backoffMax schedule suits user-facing services that should
self-heal without spinning (the Kubernetes CrashLoopBackOff lesson). For
always-available system services, the prior-art note’s systemd lesson favors a
short flat delay so a transient fault recovers fast; such services set
backoffBase == backoffMax for flat RestartSec-style behavior. In both cases
maxRestarts/windowSecs is the hard give-up budget (the OTP
max-restart-intensity lesson), so neither model spins forever.
Crash-loop detection. If maxRestarts attempts exhaust within
windowSecs, the supervisor stops restarting and records a budget-exhausted
event. The service is marked permanently failed until an operator issues an
explicit override through the ProcessHandle or re-spawns via a fresh
manifest reload.
Authority preservation. Each restart uses the original ProcessSpawner
call with the same CapGrant list that was used at initial spawn. The
supervisor does not invent new grants or escalate authority. If a grant source
was a SpawnGrantSource::Kernel DDF handle that is now invalidated (for
example, a DMA buffer whose owner quiesce failed), the restart fails closed
with a spawn-grant-invalid error rather than falling back to an ambient
grant.
No resurrection of stale caps. The restarted process receives a fresh cap
table. The supervisor must call CapRetarget (from
Live Upgrade) to re-point existing
client caps to the new process. If CapRetarget is not yet implemented,
clients observing disconnected must reconnect through the supervisor’s
exported endpoint, which the supervisor re-registers after restart.
Watchdog Capability
A service that can hang without crashing (blocked ring, infinite loop, deadlock on a kernel-held lock it does not own) is not detected by exit-path crash handling. The watchdog provides periodic liveness proof:
# Proposed future addition to schema/capos.capnp (Phase 3)
interface Watchdog {
# Service calls this on every iteration of its main loop to
# reset the deadline. If not called within `timeoutNs` of the
# last kick (or of registration), the supervisor is notified.
kick @0 () -> ();
# Unregister. Safe to call during planned shutdown.
cancel @1 () -> ();
}
interface WatchdogSource {
# Register this process with the given timeout.
# Returns a Watchdog the service holds and kicks.
register @0 (processName :Text, timeoutNs :UInt64) -> (watchdogIndex :UInt16);
}
The supervisor grants a Watchdog cap (minted from a WatchdogSource it
holds) to each service it considers watchdog-registered. If the kernel timer
fires without a kick, the supervisor receives a liveness-failure notification
and treats it identically to an unplanned crash: crash record, restart budget
check, backoff.
The watchdog is an opt-in service-level contract, not a mandatory kernel
mechanism. Services that are inherently event-driven (blocked on cap_enter
waiting for an SQE) do not need a watchdog; they will return disconnected to
callers if they stop processing. Watchdog is primarily useful for services
with internal polling loops or external I/O not driven by the capOS ring.
Degraded Boot
The manifest may declare an emergency fallback service that is promoted when a critical service exhausts its restart budget before the system reaches a usable state:
# CUE (illustrative)
degradedBoot: {
trigger: "net-stack" # if this service fails permanently...
fallback: "shell" # ...promote this service to interactive
timeoutSecs: 30 # deadline from kernel handoff to readiness
}
The init process monitors service readiness. If a declared critical service fails to reach readiness within the timeout and has exhausted its restart budget, init spawns the fallback service with a console cap and an audit cap so an operator can inspect what failed. The fallback service is not granted the failed service’s caps; it is a scoped interactive shell, not a repair agent with escalated authority.
Relevant Research and Prior Art
In-Tree Research Notes
- Crash Recovery and Supervision
is the dedicated prior-art survey for this proposal: supervision trees,
restart budgets (OTP intensity/period, systemd
StartLimit, KubernetesCrashLoopBackOff), dead-server notification (FuchsiaZX_CHANNEL_PEER_CLOSED, seL4 silence, GenodeIpc_error), and coredump-redaction concerns, each verified against primary sources. - OS Error Handling grounds
the stale-cap surface. It records what callers observe when a server dies in
comparable systems: Zircon channels close and the peer observes
ZX_ERR_PEER_CLOSED; Genode capabilities to a dead server become invalid and subsequent invocations produceIpc_error; seL4 routes faults to a per-thread fault endpoint; KeyKOS/EROS routes them to the domain keeper. The shared lesson is that a dead server must surface as a typed transport-level signal, not a hung invocation — which is exactly thedisconnecteddeath CQE this proposal specifies. - Cap’n Proto Error Handling
fixes the meaning of
disconnectedin the four-kind capnp model: “connection to a necessary capability was lost,” with the client response being re-establish-and-retry. This proposal reuses that existing classification rather than minting a new exception kind; the only addition is when the kernel emits it (unplanned death) and the pairedCrashRecord. - EROS, CapROS, Coyotos
documents the EROS/KeyKOS keeper mechanism: a capability to a separate
domain that the kernel invokes on fault, which can inspect, terminate, or
restart the faulting domain (process supervision is an explicit listed use).
capOS’s supervisor-owns-
ProcessHandlemodel is the same shape with capnp typed methods instead of a keeper key, and the kernel never initiates the restart itself. - Genode and
seL4 ground the no-resurrected-authority
invariant: Genode’s parent-supervised component tree with revocable
capabilities, and seL4’s hierarchical delegation plus
Revokeover the capability derivation tree, both establish that a restarted child gets fresh authority and that revocation of the dead instance’s caps is the supervisor’s (parent’s) responsibility, not an ambient lookup.
External Precedent and Lessons
- Erlang/OTP supervision trees. The “let it crash” philosophy plus
supervisor restart strategies (
one_for_one,one_for_all,rest_for_one) and max-restart-intensity (MaxRrestarts withinMaxTseconds, after which the supervisor itself terminates) are the direct precedent for this proposal’s per-service failure budget and crash-loop detection. The lesson: bound restarts over a sliding window and escalate (here: stop and mark permanently failed, optionally promote degraded boot) rather than loop forever. - systemd unit restart policy.
Restart=on-failure|always,RestartSecbackoff, andStartLimitIntervalSec/StartLimitBurstare the precedent for thepolicy/backoffBase/maxRestarts/windowSecsfields. The lesson: separate the whether (policy) from the pacing (backoff) from the give-up threshold (burst limit). - Kubernetes liveness/readiness probes and CrashLoopBackoff. Liveness
probes (kubelet restarts a container that fails its probe) are the precedent
for the
Watchdog.kick/timeout design; readiness gating before promotion is the precedent for the degraded-boot readiness deadline;CrashLoopBackOffwith exponential backoff is the precedent for capped exponential restart delay. The lesson: liveness is opt-in and orthogonal to crash detection — a hung-but-not-dead process needs an explicit liveness signal. - Fuchsia component lifecycle. Component-manager-driven start/stop and
rebinding in the routing graph parallel capOS’s supervisor +
CapRetargetreconnection. The dedicated research note above grounds Fuchsia’s death-observation behavior (ZX_CHANNEL_PEER_CLOSED, no implicit reconnect); a deeper write-up of Fuchsia component-manager restart and escrow semantics remains research-needed (per thedocs/backlog/research-design-gaps.mdconvention) before this proposal cites specific escrow behavior as grounding.
Phasing
Phase 1 — Stale-cap DISCONNECTED propagation and crash record (most
model-critical). The death CQE for in-flight CALLs is the highest-priority
item because it closes the model gap: callers can observe server death as a
typed transport error rather than a hung ring. Crash record delivery to the
supervisor’s AuditLog is paired here because it uses the same kernel death
path. Requires: epoch-revocation from Stage 6 (done), AuditLog cap
(done), CrashRecord schema addition.
Phase 2 — Bounded restart policy and crash-loop detection. Init reads
restart budget fields from initConfig.services, applies exponential backoff,
and stops at budget exhaustion. Requires: Phase 1 crash record so init knows
whether a death was planned or unplanned; CapRetarget from live-upgrade Phase
1 to reconnect client caps after restart.
Phase 3 — Watchdog capability. WatchdogSource and Watchdog schema,
kernel timer integration, supervisor-side timeout detection, and liveness-failure
events fed into the same restart budget path as Phase 2.
Phase 4 — Degraded boot. Manifest parser reads degradedBoot fields;
init promotes the fallback service on budget exhaustion during the boot window.
Requires: Phase 2 budget tracking.
Hazards and Invariants
Frame reuse ordering. The kernel must not return frames from a dead
process’s address space to the frame allocator until all outstanding
disconnected CQEs for that process’s caps have been delivered. Violating
this could allow a concurrent FrameAlloc to map recycled memory into a new
process before the old process’s CQEs complete, creating a window where a
stale disconnected CQE arrives after the frame holds new data. The existing
DMA quiesce/scrub ordering in the DMA pool grant path is the model for this
constraint.
No stale authority after restart. A restarted process receives only the
grants declared in the original ProcessSpawner.spawn call. The supervisor
must not silently re-grant caps that were revoked as part of the death epoch
bump. In particular, any DMAPool-derived handle that was in active use at
crash time must be explicitly re-acquired through the grant-source path, not
recycled from the dead process’s cap table.
Restart does not bypass the authority broker. If the original spawn was
gated on an AuthorityBroker-selected session context, the restart uses the
same broker path. The supervisor cannot substitute a broader session context
or an anonymous context to make the restart succeed.
Capability revocation precedes any dump. The death epoch bump that invalidates the crashed process’s caps must complete before any crash record or future coredump is produced. A record produced post-revocation sees only dead cap indices, never live authority; a pre-revocation memory snapshot could otherwise capture live cap indices or ring-buffer contents (the race class behind recent coredump CVEs). Any future coredump extension must run only after revocation and must not be readable by unprivileged dump readers.
Crash record isolation. The crash record must not carry raw cap IDs, cap table slot numbers, or any data read from the process’s address space (stack contents, heap contents, message buffers). The fault offset is relative to the binary load base, not an absolute virtual address, to avoid leaking kernel layout or userspace address randomization.
Watchdog authority is narrow. A Watchdog cap proves liveness for exactly
one registered process. It does not grant the holder any access to the
supervisor, the process’s caps, or any other service. It is a pure liveness
signal, not an authority surface.
Relationship to Adjacent Proposals
- Live Upgrade — covers the planned
case. The
CapRetargetprimitive defined there is consumed by Phase 2 of this proposal to reconnect client caps after an unplanned restart. The force-modedisconnecteddelivery and epoch-revocation paths are shared; this proposal adds the death CQE and crash record on top. - Service Architecture —
defines the supervisor tree and the
RestartPolicytype currently parsed by init. This proposal extends that policy with the budget, backoff, and budget-exhaustion fields, and binds crash handling to the supervisor that owns theProcessHandle, not to a global daemon. - capos-service — defines the
userspace service framework above
capos-rt. The watchdogkickcall and readiness notification in Phase 3 are natural additions to the service lifecycle hooks thatcapos-serviceabstracts. - Error Handling — the
disconnectedclass in the two-level error model is the transport surface for stale-cap delivery. This proposal does not add new error types; it specifies when and howdisconnectedis delivered for an unplanned death. - System Monitoring — crash records, restart events, budget-exhaustion notifications, and watchdog timeouts are all audit-worthy. The monitoring proposal owns the operator visibility surface; this proposal defines the structured events that feed it.
- Resource Accounting and Quotas — the failure budget is a quota: a count consumed by crash events and refilled by the sliding window. The accounting model for this quota follows the same ledger-of-record pattern as memory and scheduling quotas.