Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Crash Recovery and Supervision

How capOS handles unplanned process failure: propagating the death to capability holders, recording a structured crash event, and restarting the service within a bounded policy — all without resurrecting stale authority.

Problem

Live upgrade covers the planned case: a supervisor quiesces a running service, transfers state, retargets caps, and exits the old process in a controlled sequence. Unplanned failure is different. A process that panics, faults, or is killed by the kernel OOM path leaves no quiesce call, no state handoff, and no ordered exit. The kernel marks the process dead and epoch-bumps its caps, but nothing in the current model tells callers what happened or gives the supervisor a policy-bounded path to respawn it.

The gaps are:

  1. Stale-cap observability. Callers holding a cap to the dead process receive disconnected errors at the transport level (the epoch-revocation path from Stage 6 is in place), but there is no structured CQE event that carries crash context or lets the caller distinguish a crash from a planned termination.
  2. Crash metadata capture. Panic location, fault address, and last SQE opcode are useful for operators but must not leak raw cap-table contents, local cap IDs, or buffer bytes, which would break the no-ambient-authority invariant.
  3. Bounded restart policy. Re-spawning a crashing service without a budget produces crash-loop amplification; re-spawning must use the same broker and manifest authority that the original spawn used, not an escalated path.
  4. Watchdog liveness. A process that hangs without crashing is not detected by crash handling alone.
  5. Degraded boot. If a critical service fails to start, the system needs a safe fallback rather than a silent hang.

This proposal fills these gaps without touching the live-upgrade protocol and without adding a god-object supervisor.

User Stories

  • An operator running make run-smoke sees a structured crash record in the audit log when a demo service panics, not a silent stale-cap error.
  • A client process calling a crashed server receives a disconnected-class CapException promptly; the process does not block indefinitely.
  • Init restarts a failed service up to the configured failure budget, then stops and declares the service permanently failed rather than looping forever.
  • A watchdog-registered service that hangs (no panic, no exit) is detected within its timeout and restarted under the same policy.
  • If the network stack fails before a shell connects, the manifest-declared emergency shell starts instead of leaving the system unresponsive.

Design

Stale-Cap Propagation

When the kernel marks a process dead (panic, fault, or explicit terminate without a prior clean exit), it performs the same epoch-bump it already does for released caps. The existing disconnected value in ExceptionType covers the transport error. The new addition is a death CQE: a CapException { type: disconnected, message: "server-death" } delivered to any process with an outstanding CALL SQE whose target belongs to the dead process.

From the caller’s perspective an unplanned crash looks identical to a force-mode live upgrade that did not reattach: the in-flight CALL returns disconnected, epoch is bumped, and any subsequent CALL on that cap also returns disconnected until the supervisor retargets the cap to a fresh instance. No new CQE opcode is needed; the existing two-level error model from Error Handling is sufficient.

Invariants:

  • A disconnected CQE on an outstanding CALL must be delivered before the kernel recycles any frame that belonged to the dead process. Frame reuse ordering is the same constraint that applies to the force-mode live-upgrade path.
  • A cap whose epoch has been bumped must never route a new CALL to the dead process’s address space, even transiently. The epoch check is a load fence on the per-cap generation counter before any ring dispatch.
  • Endpoint client facets held by the dead process are revoked at the same epoch bump. Other processes’ client facets to the same endpoint are not affected — they route to the endpoint owner, not to the crashed client.

Crash Record Capture

When a process dies unplanned, the kernel appends a crash record to the AuditLog cap held by the supervisor that spawned the process (not to a global log visible to all processes). The record is structured to support operator debugging without leaking internal kernel state:

# Proposed addition to schema/capos.capnp (Phase 1)

enum CrashKind {
    panic @0;         # Rust panic! path
    pageFault @1;     # unmapped or protection fault
    generalProtection @2;
    stackOverflow @3;
    illegalInstruction @4;
    kernelKill @5;    # explicit ProcessHandle.terminate
}

struct CrashRecord {
    processName @0 :Text;
    kind @1 :CrashKind;
    # Instruction pointer at death, relative to ELF load base.
    # Absolute virtual address is NOT included to avoid leaking
    # kernel-side layout or userspace ASLR seeds.
    faultOffsetInBinary @2 :UInt64;
    # Last SQE opcode dispatched for this process (0 = none in flight).
    lastSqeOpcode @3 :UInt8;
    # Session context ID of the process (opaque; matches AuditLog sessionId
    # for attribution without carrying cap-table or buffer contents).
    sessionContextId @4 :Data;
    # Monotonic kernel timestamp at death.
    timestampNs @5 :UInt64;
}

Fields explicitly not included: raw cap IDs, cap-table slot contents, userspace buffer bytes, kernel heap pointers, or any data from the process’s address space beyond the fault offset. The crash record is attributed to the process’s session context ID so it can be correlated with prior AuditLog records without exposing the full cap graph.

The crash record is delivered through the same AuditLog.record path the hardware-audit service already uses: the supervisor holds the AuditLog cap; the kernel invokes it on the supervisor’s ring (via a kernel-initiated RECV) rather than on a shared global ring.

Bounded Restart Policy

The supervisor that spawned a failed process owns the restart decision. The restart budget is declared in the manifest’s initConfig.services entry and interpreted by init (or a delegated supervisor):

# CUE representation (illustrative)
restart: {
    policy:       "on-failure"   # never | on-failure | always
    maxRestarts:  5              # total budget over the window
    windowSecs:   60             # sliding window for the budget
    backoffBase:  "1s"           # initial delay before first restart
    backoffMax:   "30s"          # ceiling on exponential backoff
    emergencyFallback: "shell"   # service name to promote if budget exhausted
}

Backoff is bounded and service-class aware. The exponential backoffBasebackoffMax schedule suits user-facing services that should self-heal without spinning (the Kubernetes CrashLoopBackOff lesson). For always-available system services, the prior-art note’s systemd lesson favors a short flat delay so a transient fault recovers fast; such services set backoffBase == backoffMax for flat RestartSec-style behavior. In both cases maxRestarts/windowSecs is the hard give-up budget (the OTP max-restart-intensity lesson), so neither model spins forever.

Crash-loop detection. If maxRestarts attempts exhaust within windowSecs, the supervisor stops restarting and records a budget-exhausted event. The service is marked permanently failed until an operator issues an explicit override through the ProcessHandle or re-spawns via a fresh manifest reload.

Authority preservation. Each restart uses the original ProcessSpawner call with the same CapGrant list that was used at initial spawn. The supervisor does not invent new grants or escalate authority. If a grant source was a SpawnGrantSource::Kernel DDF handle that is now invalidated (for example, a DMA buffer whose owner quiesce failed), the restart fails closed with a spawn-grant-invalid error rather than falling back to an ambient grant.

No resurrection of stale caps. The restarted process receives a fresh cap table. The supervisor must call CapRetarget (from Live Upgrade) to re-point existing client caps to the new process. If CapRetarget is not yet implemented, clients observing disconnected must reconnect through the supervisor’s exported endpoint, which the supervisor re-registers after restart.

Watchdog Capability

A service that can hang without crashing (blocked ring, infinite loop, deadlock on a kernel-held lock it does not own) is not detected by exit-path crash handling. The watchdog provides periodic liveness proof:

# Proposed future addition to schema/capos.capnp (Phase 3)

interface Watchdog {
    # Service calls this on every iteration of its main loop to
    # reset the deadline.  If not called within `timeoutNs` of the
    # last kick (or of registration), the supervisor is notified.
    kick @0 () -> ();
    # Unregister.  Safe to call during planned shutdown.
    cancel @1 () -> ();
}

interface WatchdogSource {
    # Register this process with the given timeout.
    # Returns a Watchdog the service holds and kicks.
    register @0 (processName :Text, timeoutNs :UInt64) -> (watchdogIndex :UInt16);
}

The supervisor grants a Watchdog cap (minted from a WatchdogSource it holds) to each service it considers watchdog-registered. If the kernel timer fires without a kick, the supervisor receives a liveness-failure notification and treats it identically to an unplanned crash: crash record, restart budget check, backoff.

The watchdog is an opt-in service-level contract, not a mandatory kernel mechanism. Services that are inherently event-driven (blocked on cap_enter waiting for an SQE) do not need a watchdog; they will return disconnected to callers if they stop processing. Watchdog is primarily useful for services with internal polling loops or external I/O not driven by the capOS ring.

Degraded Boot

The manifest may declare an emergency fallback service that is promoted when a critical service exhausts its restart budget before the system reaches a usable state:

# CUE (illustrative)
degradedBoot: {
    trigger:  "net-stack"    # if this service fails permanently...
    fallback: "shell"        # ...promote this service to interactive
    timeoutSecs: 30          # deadline from kernel handoff to readiness
}

The init process monitors service readiness. If a declared critical service fails to reach readiness within the timeout and has exhausted its restart budget, init spawns the fallback service with a console cap and an audit cap so an operator can inspect what failed. The fallback service is not granted the failed service’s caps; it is a scoped interactive shell, not a repair agent with escalated authority.

Relevant Research and Prior Art

In-Tree Research Notes

  • Crash Recovery and Supervision is the dedicated prior-art survey for this proposal: supervision trees, restart budgets (OTP intensity/period, systemd StartLimit, Kubernetes CrashLoopBackOff), dead-server notification (Fuchsia ZX_CHANNEL_PEER_CLOSED, seL4 silence, Genode Ipc_error), and coredump-redaction concerns, each verified against primary sources.
  • OS Error Handling grounds the stale-cap surface. It records what callers observe when a server dies in comparable systems: Zircon channels close and the peer observes ZX_ERR_PEER_CLOSED; Genode capabilities to a dead server become invalid and subsequent invocations produce Ipc_error; seL4 routes faults to a per-thread fault endpoint; KeyKOS/EROS routes them to the domain keeper. The shared lesson is that a dead server must surface as a typed transport-level signal, not a hung invocation — which is exactly the disconnected death CQE this proposal specifies.
  • Cap’n Proto Error Handling fixes the meaning of disconnected in the four-kind capnp model: “connection to a necessary capability was lost,” with the client response being re-establish-and-retry. This proposal reuses that existing classification rather than minting a new exception kind; the only addition is when the kernel emits it (unplanned death) and the paired CrashRecord.
  • EROS, CapROS, Coyotos documents the EROS/KeyKOS keeper mechanism: a capability to a separate domain that the kernel invokes on fault, which can inspect, terminate, or restart the faulting domain (process supervision is an explicit listed use). capOS’s supervisor-owns-ProcessHandle model is the same shape with capnp typed methods instead of a keeper key, and the kernel never initiates the restart itself.
  • Genode and seL4 ground the no-resurrected-authority invariant: Genode’s parent-supervised component tree with revocable capabilities, and seL4’s hierarchical delegation plus Revoke over the capability derivation tree, both establish that a restarted child gets fresh authority and that revocation of the dead instance’s caps is the supervisor’s (parent’s) responsibility, not an ambient lookup.

External Precedent and Lessons

  • Erlang/OTP supervision trees. The “let it crash” philosophy plus supervisor restart strategies (one_for_one, one_for_all, rest_for_one) and max-restart-intensity (MaxR restarts within MaxT seconds, after which the supervisor itself terminates) are the direct precedent for this proposal’s per-service failure budget and crash-loop detection. The lesson: bound restarts over a sliding window and escalate (here: stop and mark permanently failed, optionally promote degraded boot) rather than loop forever.
  • systemd unit restart policy. Restart=on-failure|always, RestartSec backoff, and StartLimitIntervalSec/StartLimitBurst are the precedent for the policy/backoffBase/maxRestarts/windowSecs fields. The lesson: separate the whether (policy) from the pacing (backoff) from the give-up threshold (burst limit).
  • Kubernetes liveness/readiness probes and CrashLoopBackoff. Liveness probes (kubelet restarts a container that fails its probe) are the precedent for the Watchdog.kick/timeout design; readiness gating before promotion is the precedent for the degraded-boot readiness deadline; CrashLoopBackOff with exponential backoff is the precedent for capped exponential restart delay. The lesson: liveness is opt-in and orthogonal to crash detection — a hung-but-not-dead process needs an explicit liveness signal.
  • Fuchsia component lifecycle. Component-manager-driven start/stop and rebinding in the routing graph parallel capOS’s supervisor + CapRetarget reconnection. The dedicated research note above grounds Fuchsia’s death-observation behavior (ZX_CHANNEL_PEER_CLOSED, no implicit reconnect); a deeper write-up of Fuchsia component-manager restart and escrow semantics remains research-needed (per the docs/backlog/research-design-gaps.md convention) before this proposal cites specific escrow behavior as grounding.

Phasing

Phase 1 — Stale-cap DISCONNECTED propagation and crash record (most model-critical). The death CQE for in-flight CALLs is the highest-priority item because it closes the model gap: callers can observe server death as a typed transport error rather than a hung ring. Crash record delivery to the supervisor’s AuditLog is paired here because it uses the same kernel death path. Requires: epoch-revocation from Stage 6 (done), AuditLog cap (done), CrashRecord schema addition.

Phase 2 — Bounded restart policy and crash-loop detection. Init reads restart budget fields from initConfig.services, applies exponential backoff, and stops at budget exhaustion. Requires: Phase 1 crash record so init knows whether a death was planned or unplanned; CapRetarget from live-upgrade Phase 1 to reconnect client caps after restart.

Phase 3 — Watchdog capability. WatchdogSource and Watchdog schema, kernel timer integration, supervisor-side timeout detection, and liveness-failure events fed into the same restart budget path as Phase 2.

Phase 4 — Degraded boot. Manifest parser reads degradedBoot fields; init promotes the fallback service on budget exhaustion during the boot window. Requires: Phase 2 budget tracking.

Hazards and Invariants

Frame reuse ordering. The kernel must not return frames from a dead process’s address space to the frame allocator until all outstanding disconnected CQEs for that process’s caps have been delivered. Violating this could allow a concurrent FrameAlloc to map recycled memory into a new process before the old process’s CQEs complete, creating a window where a stale disconnected CQE arrives after the frame holds new data. The existing DMA quiesce/scrub ordering in the DMA pool grant path is the model for this constraint.

No stale authority after restart. A restarted process receives only the grants declared in the original ProcessSpawner.spawn call. The supervisor must not silently re-grant caps that were revoked as part of the death epoch bump. In particular, any DMAPool-derived handle that was in active use at crash time must be explicitly re-acquired through the grant-source path, not recycled from the dead process’s cap table.

Restart does not bypass the authority broker. If the original spawn was gated on an AuthorityBroker-selected session context, the restart uses the same broker path. The supervisor cannot substitute a broader session context or an anonymous context to make the restart succeed.

Capability revocation precedes any dump. The death epoch bump that invalidates the crashed process’s caps must complete before any crash record or future coredump is produced. A record produced post-revocation sees only dead cap indices, never live authority; a pre-revocation memory snapshot could otherwise capture live cap indices or ring-buffer contents (the race class behind recent coredump CVEs). Any future coredump extension must run only after revocation and must not be readable by unprivileged dump readers.

Crash record isolation. The crash record must not carry raw cap IDs, cap table slot numbers, or any data read from the process’s address space (stack contents, heap contents, message buffers). The fault offset is relative to the binary load base, not an absolute virtual address, to avoid leaking kernel layout or userspace address randomization.

Watchdog authority is narrow. A Watchdog cap proves liveness for exactly one registered process. It does not grant the holder any access to the supervisor, the process’s caps, or any other service. It is a pure liveness signal, not an authority surface.

Relationship to Adjacent Proposals

  • Live Upgrade — covers the planned case. The CapRetarget primitive defined there is consumed by Phase 2 of this proposal to reconnect client caps after an unplanned restart. The force-mode disconnected delivery and epoch-revocation paths are shared; this proposal adds the death CQE and crash record on top.
  • Service Architecture — defines the supervisor tree and the RestartPolicy type currently parsed by init. This proposal extends that policy with the budget, backoff, and budget-exhaustion fields, and binds crash handling to the supervisor that owns the ProcessHandle, not to a global daemon.
  • capos-service — defines the userspace service framework above capos-rt. The watchdog kick call and readiness notification in Phase 3 are natural additions to the service lifecycle hooks that capos-service abstracts.
  • Error Handling — the disconnected class in the two-level error model is the transport surface for stale-cap delivery. This proposal does not add new error types; it specifies when and how disconnected is delivered for an unplanned death.
  • System Monitoring — crash records, restart events, budget-exhaustion notifications, and watchdog timeouts are all audit-worthy. The monitoring proposal owns the operator visibility surface; this proposal defines the structured events that feed it.
  • Resource Accounting and Quotas — the failure budget is a quota: a count consumed by crash events and refilled by the sliding window. The accounting model for this quota follows the same ledger-of-record pattern as memory and scheduling quotas.