Proposal: Crash Recovery and Supervision

How capOS handles unplanned process failure: propagating the death to capability holders, recording a structured crash event, and restarting the service within a bounded policy — all without resurrecting stale authority.

Problem

Live upgrade covers the planned case: a supervisor quiesces a running service, transfers state, retargets caps, and exits the old process in a controlled sequence. Unplanned failure is different. A process that panics, faults, or is killed by the kernel OOM path leaves no quiesce call, no state handoff, and no ordered exit. The kernel marks the process dead and epoch-bumps its caps, but nothing in the current model tells callers what happened or gives the supervisor a policy-bounded path to respawn it.

The gaps are:

Stale-cap observability. A caller with an outstanding CALL against a dying server was already completed rather than left hanging — the process teardown path cancels the dead process’s endpoint state — but the completion carried the generic CAP_ERR_INVOKE_FAILED, which the kernel also posts for many unrelated failures. The caller could not tell “my server died” from “this invocation failed”, and there was no crash record.

Status: partially closed by Phase 1. Two distinct questions hide in this gap, and Phase 1 answers only the first:
- “Is my server gone, so that retrying this cap is pointless?” — closed. The caller now gets the typed CAP_ERR_SERVER_DIED (see Stale-Cap Propagation).
- “Did it crash, or did it terminate in an orderly way?” — open. The CQE deliberately does not distinguish the two (both mean the same thing to a caller: re-establish), and the crash record that carries the distinction reaches only the kernel audit log, which no supervisor can read. A supervisor therefore still cannot tell a crash from a planned termination. Closing this needs the Phase 2 delivery of CrashRecord to the supervisor’s AuditLog cap.
Crash metadata capture. Panic location, fault address, and last SQE opcode are useful for operators but must not leak raw cap-table contents, local cap IDs, or buffer bytes, which would break the no-ambient-authority invariant.
Bounded restart policy. Re-spawning a crashing service without a budget produces crash-loop amplification; re-spawning must use the same broker and manifest authority that the original spawn used, not an escalated path.
Watchdog liveness. A process that hangs without crashing is not detected by crash handling alone.
Degraded boot. If a critical service fails to start, the system needs a safe fallback rather than a silent hang.

This proposal fills these gaps without touching the live-upgrade protocol and without adding a god-object supervisor.

User Stories

An operator sees a structured crash record in the audit log when a service dies unplanned, not a silent stale-cap error. (Phase 1 emits this record for CPL3 faults and ProcessHandle.terminate. A Rust panic! is deliberately not covered yet: the capos-rt panic handler exits through the ordinary exit syscall, which is a planned exit, so CrashKind::panic is never emitted. Routing panics into the unplanned-death path is capos-rt work.)
A client process calling a crashed server is promptly completed with a typed CAP_ERR_SERVER_DIED transport error rather than a generic invoke failure; the process does not block indefinitely.
Init restarts a failed service up to the configured failure budget, then stops and declares the service permanently failed rather than looping forever.
A watchdog-registered service that hangs (no panic, no exit) is detected within its timeout and restarted under the same policy.
If the network stack fails before a shell connects, the manifest-declared emergency shell starts instead of leaving the system unresponsive.

Design

Stale-Cap Propagation

Status: implemented (Phase 1). What follows describes what the kernel does today, and where it deliberately departs from this proposal’s original sketch.

When a process is torn down, the kernel cancels the endpoint state of every endpoint that process owned. Every queued and in-flight CALL against those endpoints belongs to some other process and can never be RETURNed, so each is completed with a death CQE carrying CAP_ERR_SERVER_DIED — a distinct ring error code (capos-config/src/ring.rs) meaning “the serving process is gone; re-establish rather than retry”.

From the caller’s perspective an unplanned crash looks the same as a force-mode live upgrade that did not reattach: the outstanding CALL completes with CAP_ERR_SERVER_DIED and the caller must reconnect through the supervisor’s exported endpoint rather than retry. No new CQE opcode is needed.

Deviation: a CQE result code, not a CapException in the caller’s buffer. This proposal originally specified delivering CapException { type: disconnected, message: "server-death" } into the caller’s result buffer. That was not implemented, and should not be. Writing a CapException into the caller’s result buffer from the server’s death path is a cross-process write into a buffer whose owner may already have dropped it — the exact use-after-free class that make run-endpoint-drop-inflight-uaf exists to prevent. The cancel path therefore posts only a CQE result code and never touches the caller’s buffer. The disconnected CapException remains the revocation path’s wire shape and is unchanged; server death is signalled by the CQE code alone.

Scope: owner death, not every cancellation. Only cancels caused by the endpoint owner’s process teardown carry CAP_ERR_SERVER_DIED. Endpoint state torn down while the owner is still alive — an owner-cap RELEASE, a thread exit, a capability revocation — keeps the pre-existing generic CAP_ERR_INVOKE_FAILED, because in those cases “the server died” is not the truth the caller can act on. Planned and unplanned owner death are not distinguished at this seam: from the caller’s side both mean the server is gone and the cap must be re-established. The planned/unplanned distinction is what the crash record below carries, and it is what a Phase 2 supervisor uses to decide whether to restart.

Invariants:

A death CQE on an outstanding CALL must be delivered before the kernel recycles any frame that belonged to the dead process. Frame reuse ordering is the same constraint that applies to the force-mode live-upgrade path. Cancellation that cannot be posted immediately (the target ring is momentarily unpostable) is deferred with its result code captured, so a deferred death CQE cannot silently degrade to the generic failure code.
A cap whose epoch has been bumped must never route a new CALL to the dead process’s address space, even transiently. The epoch check is a load fence on the per-cap generation counter before any ring dispatch.
Endpoint client facets held by the dead process are revoked at the same epoch bump. Other processes’ client facets to the same endpoint are not affected — they route to the endpoint owner, not to the crashed client.

Crash Record Capture

Status: partially implemented (Phase 1). The CrashKind/CrashRecord schema below has landed in schema/capos.capnp, and the kernel emits these fields for every unplanned death. Delivery to the supervisor’s AuditLog cap is Phase 2; Phase 1 emits the record to the kernel audit log instead, as a kernel-internal [audit] event=crash record.

AuditEventType deliberately gained no crash enumerant: crash events are kernel-internal, so no holder of an AuditLog cap can forge a crash record for a process that never died. This follows the existing debug precedent in kernel/src/cap/audit_log.rs.

Two fields are honest placeholders in Phase 1. lastSqeOpcode is always 0 (“none in flight”): the kernel retains no per-process last-dispatched opcode, and reconstructing one from ring state at death would be a guess. stackOverflow is never emitted: the x86_64 fault classifier sees a stack guard hit as an ordinary #PF and does not distinguish it from pageFault.

The fault classifier maps #PF to pageFault, #GP to generalProtection, and #UD to illegalInstruction. #DB (a user-set trap flag) and #BP (int3) are also folded into illegalInstruction: there is no trap enumerant, and these are the same class of event as #UD — userspace executed an instruction the kernel has no capability to make meaningful. A dedicated trap enumerant would be an additive schema change if the distinction becomes load-bearing. ProcessHandle.terminate emits kernelKill with a faultOffsetInBinary of 0, since an externally-killed process has no fault site. Where a kill races a real fault, the fault’s record wins: the first noted cause is what actually killed the process.

The record is structured to support operator debugging without leaking internal kernel state:

# schema/capos.capnp (Phase 1, landed)

enum CrashKind {
    panic @0;         # Rust panic! path
    pageFault @1;     # unmapped or protection fault
    generalProtection @2;
    stackOverflow @3;
    illegalInstruction @4;
    kernelKill @5;    # explicit ProcessHandle.terminate
}

struct CrashRecord {
    processName @0 :Text;
    kind @1 :CrashKind;
    # Instruction pointer at death, relative to ELF load base.
    # Absolute virtual address is NOT included to avoid leaking
    # kernel-side layout or userspace ASLR seeds.
    faultOffsetInBinary @2 :UInt64;
    # Last SQE opcode dispatched for this process (0 = none in flight).
    lastSqeOpcode @3 :UInt8;
    # Session context ID of the process (opaque; matches AuditLog sessionId
    # for attribution without carrying cap-table or buffer contents).
    sessionContextId @4 :Data;
    # Monotonic kernel timestamp at death.
    timestampNs @5 :UInt64;
}

Fields explicitly not included: raw cap IDs, cap-table slot contents, userspace buffer bytes, kernel heap pointers, or any data from the process’s address space beyond the fault offset. The crash record is attributed to the process’s session context ID so it can be correlated with prior AuditLog records without exposing the full cap graph.

The Phase 2 target is to deliver the crash record through the same AuditLog.record path the hardware-audit service already uses: the supervisor holds the AuditLog cap; the kernel invokes it on the supervisor’s ring (via a kernel-initiated RECV) rather than on a shared global ring. Phase 1 emits the same fields to the kernel audit log, which is enough for an operator to see the crash but does not yet give a supervisor a programmatic signal.

Bounded Restart Policy

The supervisor that spawned a failed process owns the restart decision. The restart budget is declared in the manifest’s initConfig.services entry and interpreted by init (or a delegated supervisor):

# CUE representation (illustrative)
restart: {
    policy:       "on-failure"   # never | on-failure | always
    maxRestarts:  5              # total budget over the window
    windowSecs:   60             # sliding window for the budget
    backoffBase:  "1s"           # initial delay before first restart
    backoffMax:   "30s"          # ceiling on exponential backoff
    emergencyFallback: "shell"   # service name to promote if budget exhausted
}

Backoff is bounded and service-class aware. The exponential backoffBase→backoffMax schedule suits user-facing services that should self-heal without spinning (the Kubernetes CrashLoopBackOff lesson). For always-available system services, the prior-art note’s systemd lesson favors a short flat delay so a transient fault recovers fast; such services set backoffBase == backoffMax for flat RestartSec-style behavior. In both cases maxRestarts/windowSecs is the hard give-up budget (the OTP max-restart-intensity lesson), so neither model spins forever.

Crash-loop detection. If maxRestarts attempts exhaust within windowSecs, the supervisor stops restarting and records a budget-exhausted event. The service is marked permanently failed until an operator issues an explicit override through the ProcessHandle or re-spawns via a fresh manifest reload.

Authority preservation. Each restart uses the original ProcessSpawner call with the same CapGrant list that was used at initial spawn. The supervisor does not invent new grants or escalate authority. If a grant source was a SpawnGrantSource::Kernel DDF handle that is now invalidated (for example, a DMA buffer whose owner quiesce failed), the restart fails closed with a spawn-grant-invalid error rather than falling back to an ambient grant.

No resurrection of stale caps. The restarted process receives a fresh cap table. The supervisor must call CapRetarget (from Live Upgrade) to re-point existing client caps to the new process. If CapRetarget is not yet implemented, clients observing disconnected must reconnect through the supervisor’s exported endpoint, which the supervisor re-registers after restart.

Watchdog Capability

A service that can hang without crashing (blocked ring, infinite loop, deadlock on a kernel-held lock it does not own) is not detected by exit-path crash handling. The watchdog provides periodic liveness proof:

# Proposed future addition to schema/capos.capnp (Phase 3)

interface Watchdog {
    # Service calls this on every iteration of its main loop to
    # reset the deadline.  If not called within `timeoutNs` of the
    # last kick (or of registration), the supervisor is notified.
    kick @0 () -> ();
    # Unregister.  Safe to call during planned shutdown.
    cancel @1 () -> ();
}

interface WatchdogSource {
    # Register this process with the given timeout.
    # Returns a Watchdog the service holds and kicks.
    register @0 (processName :Text, timeoutNs :UInt64) -> (watchdogIndex :UInt16);
}

The supervisor grants a Watchdog cap (minted from a WatchdogSource it holds) to each service it considers watchdog-registered. If the kernel timer fires without a kick, the supervisor receives a liveness-failure notification and treats it identically to an unplanned crash: crash record, restart budget check, backoff.

The watchdog is an opt-in service-level contract, not a mandatory kernel mechanism. Services that are inherently event-driven (blocked on cap_enter waiting for an SQE) do not need a watchdog; they will return disconnected to callers if they stop processing. Watchdog is primarily useful for services with internal polling loops or external I/O not driven by the capOS ring.

Degraded Boot

The manifest may declare an emergency fallback service that is promoted when a critical service exhausts its restart budget before the system reaches a usable state:

# CUE (illustrative)
degradedBoot: {
    trigger:  "net-stack"    # if this service fails permanently...
    fallback: "shell"        # ...promote this service to interactive
    timeoutSecs: 30          # deadline from kernel handoff to readiness
}

The init process monitors service readiness. If a declared critical service fails to reach readiness within the timeout and has exhausted its restart budget, init spawns the fallback service with a console cap and an audit cap so an operator can inspect what failed. The fallback service is not granted the failed service’s caps; it is a scoped interactive shell, not a repair agent with escalated authority.

Relevant Research and Prior Art

In-Tree Research Notes

Crash Recovery and Supervision is the dedicated prior-art survey for this proposal: supervision trees, restart budgets (OTP intensity/period, systemd StartLimit, Kubernetes CrashLoopBackOff), dead-server notification (Fuchsia ZX_CHANNEL_PEER_CLOSED, seL4 silence, Genode Ipc_error), and coredump-redaction concerns, each verified against primary sources.
OS Error Handling grounds the stale-cap surface. It records what callers observe when a server dies in comparable systems: Zircon channels close and the peer observes ZX_ERR_PEER_CLOSED; Genode capabilities to a dead server become invalid and subsequent invocations produce Ipc_error; seL4 routes faults to a per-thread fault endpoint; KeyKOS/EROS routes them to the domain keeper. The shared lesson is that a dead server must surface as a typed transport-level signal, not a hung invocation — which is exactly the disconnected death CQE this proposal specifies.
Cap’n Proto Error Handling fixes the meaning of disconnected in the four-kind capnp model: “connection to a necessary capability was lost,” with the client response being re-establish-and-retry. This proposal reuses that existing classification rather than minting a new exception kind; the only addition is when the kernel emits it (unplanned death) and the paired CrashRecord.
EROS, CapROS, Coyotos documents the EROS/KeyKOS keeper mechanism: a capability to a separate domain that the kernel invokes on fault, which can inspect, terminate, or restart the faulting domain (process supervision is an explicit listed use). capOS’s supervisor-owns-ProcessHandle model is the same shape with capnp typed methods instead of a keeper key, and the kernel never initiates the restart itself.
Genode and seL4 ground the no-resurrected-authority invariant: Genode’s parent-supervised component tree with revocable capabilities, and seL4’s hierarchical delegation plus Revoke over the capability derivation tree, both establish that a restarted child gets fresh authority and that revocation of the dead instance’s caps is the supervisor’s (parent’s) responsibility, not an ambient lookup.

External Precedent and Lessons

Erlang/OTP supervision trees. The “let it crash” philosophy plus supervisor restart strategies (one_for_one, one_for_all, rest_for_one) and max-restart-intensity (MaxR restarts within MaxT seconds, after which the supervisor itself terminates) are the direct precedent for this proposal’s per-service failure budget and crash-loop detection. The lesson: bound restarts over a sliding window and escalate (here: stop and mark permanently failed, optionally promote degraded boot) rather than loop forever.
systemd unit restart policy. Restart=on-failure|always, RestartSec backoff, and StartLimitIntervalSec/StartLimitBurst are the precedent for the policy/backoffBase/maxRestarts/windowSecs fields. The lesson: separate the whether (policy) from the pacing (backoff) from the give-up threshold (burst limit).
Kubernetes liveness/readiness probes and CrashLoopBackoff. Liveness probes (kubelet restarts a container that fails its probe) are the precedent for the Watchdog.kick/timeout design; readiness gating before promotion is the precedent for the degraded-boot readiness deadline; CrashLoopBackOff with exponential backoff is the precedent for capped exponential restart delay. The lesson: liveness is opt-in and orthogonal to crash detection — a hung-but-not-dead process needs an explicit liveness signal.
Fuchsia component lifecycle. Component-manager-driven start/stop and rebinding in the routing graph parallel capOS’s supervisor + CapRetarget reconnection. The dedicated research note above grounds Fuchsia’s death-observation behavior (ZX_CHANNEL_PEER_CLOSED, no implicit reconnect); a deeper write-up of Fuchsia component-manager restart and escrow semantics remains research-needed (per the docs/backlog/research-design-gaps.md convention) before this proposal cites specific escrow behavior as grounding.

Phasing

Phase 1 — Stale-cap death-CQE propagation and crash record. Landed, with gap #1 only partly closed. Callers now observe server death as the typed CAP_ERR_SERVER_DIED transport error rather than as a generic invoke failure, and unplanned deaths emit a redacted crash record to the kernel audit log. make run-crash-disconnect proves both the queued and the in-flight cancel paths against a server that dies by fault.

What “landed” does not mean: a supervisor still cannot tell a crash from a planned termination. That half of gap #1 is untouched by Phase 1 and is Phase 2 work — the CQE deliberately does not carry the distinction, and the crash record that does goes only to the kernel audit log, which no supervisor reads. Three items originally scoped here moved to Phase 2: delivery of CrashRecord to the supervisor’s AuditLog cap, a lastSqeOpcode that is not always 0, and CrashKind::panic (see Crash Record Capture).

SMP concurrent-exit consistency (resolved 2026-07-17 18:53 UTC). Originally a Phase 1 residual: when two threads of the same multithreaded process called exit concurrently, each decided independently — before either was marked exited — whether its own exit was also its process’s death. Both could observe a still-live sibling and classify their in-flight cancellations as an ordinary teardown, even though the process died moments later. An in-flight call served by such a thread then completed with CAP_ERR_INVOKE_FAILED instead of CAP_ERR_SERVER_DIED, while merely queued calls against the same endpoint — reaped by the process-wide path — still reported CAP_ERR_SERVER_DIED. The caller never hung; it just got the less specific code.

Deciding this correctly requires knowing the process’s fate at a moment when it is genuinely not yet determined — a sibling that has not declared its intent may keep serving — so a tighter read of a live-thread snapshot cannot close it. The fix instead defers the cancel decision until the fate is known. At a thread exit, the caller completions for in-flight calls the exiting thread was serving are not classified in place: they are held on the process as (caller_thread, user_data) pairs (the in-flight endpoint entry itself is still removed). Two resolution points settle them. If the process dies, release_caps_for_exit drains the held pairs as CAP_ERR_SERVER_DIED. If the process instead stably survives — still in the process table with a live thread, no thread currently in the exit path, and no whole-process termination queued — a periodic scheduler sweep (service_periodic_work) drains them as CAP_ERR_INVOKE_FAILED; a thread exit the process genuinely survived must stay the generic code (no false positive). A per-process exit-path counter is what lets the sweep tell “a worker left while the server lives on” from “an exit is still in progress”.

Resolving lazily, rather than from the exiting thread’s own snapshot, is what closes the SMP race. A concurrent multithreaded death completes in microseconds — far inside a single scheduler tick — so the server-death drain always wins the race against the survivor sweep, and both in-flight calls report CAP_ERR_SERVER_DIED regardless of the exact interleaving or host load. A process that truly outlives one worker’s exit sees the deferred completions resolve to the generic code at the next tick. The CPL3-fault whole-process path keeps its CAP_ERR_SERVER_DIED result through the pending-termination signal, which also excludes a dying process from the survivor sweep. A single-threaded server (the common service shape) was never affected. Proofs: make run-crash-disconnect (single-threaded) and make run-crash-disconnect-multithread (two serving threads exiting concurrently under SMP; both in-flight calls and the queued call observe CAP_ERR_SERVER_DIED).

Phase 2 — Bounded restart policy and crash-loop detection. Init reads restart budget fields from initConfig.services, applies exponential backoff, and stops at budget exhaustion. Requires: delivery of the Phase 1 crash record to the supervisor’s AuditLog cap, so init can tell a planned death from an unplanned one (Phase 1 emits the record to the kernel audit log only, which no supervisor can read); CapRetarget from live-upgrade Phase 1 to reconnect client caps after restart.

Phase 3 — Watchdog capability. WatchdogSource and Watchdog schema, kernel timer integration, supervisor-side timeout detection, and liveness-failure events fed into the same restart budget path as Phase 2.

Phase 4 — Degraded boot. Manifest parser reads degradedBoot fields; init promotes the fallback service on budget exhaustion during the boot window. Requires: Phase 2 budget tracking.

Hazards and Invariants

Frame reuse ordering. The kernel must not return frames from a dead process’s address space to the frame allocator until all outstanding disconnected CQEs for that process’s caps have been delivered. Violating this could allow a concurrent FrameAlloc to map recycled memory into a new process before the old process’s CQEs complete, creating a window where a stale disconnected CQE arrives after the frame holds new data. The existing DMA quiesce/scrub ordering in the DMA pool grant path is the model for this constraint.

No stale authority after restart. A restarted process receives only the grants declared in the original ProcessSpawner.spawn call. The supervisor must not silently re-grant caps that were revoked as part of the death epoch bump. In particular, any DMAPool-derived handle that was in active use at crash time must be explicitly re-acquired through the grant-source path, not recycled from the dead process’s cap table.

Restart does not bypass the authority broker. If the original spawn was gated on an AuthorityBroker-selected session context, the restart uses the same broker path. The supervisor cannot substitute a broader session context or an anonymous context to make the restart succeed.

Capability revocation precedes any dump. The death epoch bump that invalidates the crashed process’s caps must complete before any crash record or future coredump is produced. A record produced post-revocation sees only dead cap indices, never live authority; a pre-revocation memory snapshot could otherwise capture live cap indices or ring-buffer contents (the race class behind recent coredump CVEs). Any future coredump extension must run only after revocation and must not be readable by unprivileged dump readers.

Crash record isolation. The crash record must not carry raw cap IDs, cap table slot numbers, or any data read from the process’s address space (stack contents, heap contents, message buffers). The fault offset is relative to the binary load base, not an absolute virtual address, to avoid leaking kernel layout or userspace address randomization.

Watchdog authority is narrow. A Watchdog cap proves liveness for exactly one registered process. It does not grant the holder any access to the supervisor, the process’s caps, or any other service. It is a pure liveness signal, not an authority surface.

Relationship to Adjacent Proposals

Live Upgrade — covers the planned case. The CapRetarget primitive defined there is consumed by Phase 2 of this proposal to reconnect client caps after an unplanned restart. The force-mode disconnected delivery and epoch-revocation paths are shared; this proposal adds the death CQE and crash record on top.
Service Architecture — defines the supervisor tree and the RestartPolicy type currently parsed by init. This proposal extends that policy with the budget, backoff, and budget-exhaustion fields, and binds crash handling to the supervisor that owns the ProcessHandle, not to a global daemon.
capos-service — defines the userspace service framework above capos-rt. The watchdog kick call and readiness notification in Phase 3 are natural additions to the service lifecycle hooks that capos-service abstracts.
Error Handling — the disconnected class in the two-level error model is the transport surface for stale-cap delivery. This proposal does not add new error types; it specifies when and how disconnected is delivered for an unplanned death.
System Monitoring — crash records, restart events, budget-exhaustion notifications, and watchdog timeouts are all audit-worthy. The monitoring proposal owns the operator visibility surface; this proposal defines the structured events that feed it.
Resource Accounting and Quotas — the failure budget is a quota: a count consumed by crash events and refilled by the sliding window. The accounting model for this quota follows the same ledger-of-record pattern as memory and scheduling quotas.

Keyboard shortcuts

capOS Documentation