Proposal: Live Upgrade

Replacing a running service with a new binary, without dropping outstanding capability references or losing in-flight work. The kernel-side primitive (CapRetarget) is owned by this proposal; the surrounding orchestration (supervisors, manifest sources, fault containment) is owned by service-architecture-proposal.md and consumes the primitive defined here.

Problem

In a Linux-like system, “upgrading a service” is one of:

Restart: stop the old process, start the new one. Clients holding file descriptors, sockets, or pipes to the old process receive ECONNRESET or EPIPE and must reconnect. Session state is lost unless clients serialize it themselves.
Graceful restart (nginx -s reload, unicorn, systemd socket activation): new process starts alongside old, inherits the listening socket, old drains in-flight requests. Works only for request/response protocols where the session is the request. Does nothing for stateful sessions.
Live patch (kpatch, ksplice): binary-level function replacement. Narrow, fragile, no schema for state layout changes.

None of these compose with a capability OS. A CapId held by a client points at a specific process; if that process exits, the cap is dead. There is no “the service” abstraction the kernel could re-bind — the point of capabilities is that they identify a specific reference, not a name that could be redirected after the fact.

But capOS has a kernel-side primitive the Linux model lacks: the kernel already owns the authoritative table of every CapId and which process serves it. Rewriting “cap X is served by process v1” → “cap X is served by process v2” is a table update. The question is when it is safe, and how v2 inherits enough state to answer the next call.

Three Cases

Live upgrade has three distinct cost profiles. The right design is to make each one explicit rather than pretend the hard case doesn’t exist.

Case 1: Stateless services

Each SQE is independent; the service holds no state that matters across calls. A request router, a pure codec, a logger that flushes to an external sink.

Upgrade is trivial: start v2, retarget every CapId from v1 to v2, exit v1. Clients may observe a small latency spike; no DISCONNECTED CQE fires. Only the kernel primitive is needed.

Case 2: State externalized into other caps

The service’s in-memory data is a cache or dispatch table; durable state lives behind caps the service holds (Store, SessionMap, Namespace). v1’s held caps are passed to v2 at spawn time (via the supervisor, per the manifest), kernel retargets client caps, v1 exits.

Architecturally this is the idiomatic capOS pattern: services stay thin, state is factored into dedicated holders with their own caps. The Fetch/HttpEndpoint split in the service-architecture proposal already pushes in this direction. In that world, most services fall into this bucket by construction.

Case 3: Stateful services requiring migration

The service has in-memory state that matters: a JIT’s code cache, a codec’s ring buffer, a parser’s arena, session data not yet flushed. Upgrade requires v1 to hand its state to v2.

capOS’s contribution here is that the state wire format is already capnp — the same format the service uses for IPC. v1 serializes its state as a capnp message; v2 consumes it. There is no separate serialization layer to build and no opportunity for it to drift from the IPC format.

The contract extends the service’s capnp interface:

interface Upgradable {
    # Called on v1 by the supervisor. Returns a snapshot of service
    # state and stops accepting new calls. Calls already in flight
    # complete before the snapshot returns.
    quiesce @0 () -> (state :Data);

    # Called on v2 after spawn. Loads state from the snapshot. After
    # this returns, v2 is ready to serve calls.
    resume @1 (state :Data) -> ();
}

The state schema is service-defined. Schema evolution follows capnp’s standard rules: adding fields is backward-compatible, renaming requires care, removing requires a major version bump.

Kernel Primitive: CapRetarget

Status: implemented (kernel/src/cap/process_control.rs). graceful is proven end to end by make run-cap-retarget, and force by make run-cap-retarget-force: at one identical non-quiesced state a graceful retarget refuses and a force retarget succeeds, the in-flight caller observes the typed CAP_ERR_SERVER_DIED disconnect, and its CapId stays valid so the retry is served by the successor.

The shipped schema differs from the sketch below in two respects. The handles are passed as CapIds into the caller’s own capability table (`oldHandle: UInt32), not as capnp interface references, because capOS dispatch is manual rather than capnp-rpc; the authority argument is unchanged and is in fact strengthened by it, since the kernel reads the target identity off the ProcessHandleobject behind the caller's own slot, never off the wire. And the method returnsretargeted :UInt32`, the number of endpoints moved, so a supervisor can tell “nothing to move” from “moved”.

Two properties the sketch left implicit, and which the implementation pins:

The move is bound to the process generation the handle was minted for, so a recycled pid can never inherit another process’s endpoints.
“new implements a schema superset of old” needs no kernel check, because new inherits the identical endpoint object and hold rather than a re-created one. The interface cannot silently narrow across a retarget.

What is NOT implemented: Upgradable (Phase 2), quiesce/drain (Phase 3), and stateful migration (Phase 4). Phase 1 requires the supervisor to have quiesced old by other means – a graceful retarget fails closed rather than stranding an in-flight caller.

The kernel exposes the retarget as a capability method, not a syscall:

interface ProcessControl {
    # Atomically redirect every CapId currently served by `old` to
    # be served by `new`. Requires: `new` implements a schema
    # superset of `old` (schema-id compatibility), `new` is Ready,
    # `old` is Quiesced (graceful) or the caller has permission to
    # force.
    retargetCaps @0 (old :ProcessHandle, new :ProcessHandle,
                     mode :RetargetMode) -> ();
}

enum RetargetMode {
    graceful @0;  # old must be Quiesced; in-flight calls drain on old
    force    @1;  # caps redirect immediately; in-flight calls fail
}

Only a process holding a ProcessControl cap to both processes can perform this — typically the supervisor that spawned them. The kernel never initiates upgrades.

Atomicity is per-CapId. From a client’s perspective, the retarget is a single point in time: a CALL SQE submitted before retarget goes to v1; a CALL SQE submitted after goes to v2. A CALL already dispatched to v1 either completes there (graceful — target design, Phase 3; in Phase 1 graceful instead refuses to retarget at all while such a call exists, see “In-Flight Calls” below) or returns a DISCONNECTED CQE (force).

Supervisor-Level Upgrade Protocol

The primitives above compose into a protocol the supervisor runs:

1. spawn v2 from the new binary in the manifest
2. Case 1 & 2: v2.resume(EMPTY_STATE)
   Case 3:     state = v1.quiesce()
               v2.resume(state)
3. kernel.retargetCaps(v1, v2, graceful)
4. wait for v1 to drain (graceful mode)
5. v1.exit()

If any step fails, the supervisor rolls back: kill v2, resume v1 (if quiesced), log the failure. Because the retarget hasn’t happened yet, clients never observe the aborted attempt.

In-Flight Calls

The subtle case is a client that has already posted a CALL SQE to v1 when the retarget happens. Two options:

Graceful mode (target design; drain is Phase 3). v1 finishes the call, kernel routes the CQE back to the client on v1’s ring. v1 exits only after its ring is empty. This preserves call semantics; v1 and v2 coexist briefly.
Force mode. The in-flight CALL completes with the typed CAP_ERR_SERVER_DIED; the client’s CapId stays valid and a retry is served by v2. Appropriate when v1 is wedged and quiesce won’t return. Proven by make run-cap-retarget-force, which drives graceful (refuses) and force (succeeds) at the same non-quiesced state.

In graceful mode the client cannot distinguish “call landed on v1” from “call landed on v2” — which is the point. Capability identity survives the upgrade; process identity does not.

What Phase 1 actually ships is the fail-closed half of that contract, not the drain. Without quiesce (Phase 2) or drain (Phase 3) there is no way for the kernel to let v1 finish an in-flight call and still guarantee the RETURN lands: RETURN resolves through the returner’s own capability table, and after the move v1 no longer holds the cap. So graceful today refuses — it fails closed and moves nothing — if v1 holds a call in flight or has a RECV parked (a parked RECV is a call in flight one step later, and a concurrent CALL can be delivered into it without ever taking v1’s table lock). The supervisor is expected to have quiesced v1 by other means first, which is the order this proposal’s own protocol specifies. Calls merely queued on the endpoint are inherited by v2 and answered normally: no client observes a disconnect, which is the property the graceful path exists for and the one make run-cap-retarget proves.

Relationship to Fault Containment

Live upgrade and fault containment (driver panics → supervisor respawns) share machinery. The difference is one step of the protocol:

Fault containment: v1 has crashed; kernel has already marked it dead and epoch-bumped its caps. Supervisor spawns v2, issues a graceful retarget (no quiesce — v1 is gone; in-flight CALLs already delivered DISCONNECTED). Clients reconnect to v2.
Live upgrade: v1 is healthy; supervisor initiates quiesce → state transfer → retarget, and no CQE ever reports DISCONNECTED to any caller.

The epoch-based revocation work from Stage 6 is the foundation for both. CapRetarget is one additional primitive layered on top.

Security and Trust

Live upgrade does not expand the trust model. The supervisor already holds the authority to kill, restart, and reassign caps for services it spawned — upgrade is a refinement of that authority, not a new principal. Requirements:

Only a holder of ProcessControl caps to both old and new can call retargetCaps. By construction this is the supervisor that spawned them.
The new binary must be legitimately obtained — in practice, loaded from the same content-addressed store as everything else (ties to Content-Addressed Boot).
Schema compatibility (new is a superset of old) cannot be silently narrowed by a retarget. This needs no kernel check, because new inherits the identical endpoint object and hold rather than a re-created one: there is no re-declaration of the interface for the kernel to validate. What a retarget changes is which process serves the endpoint, never what the endpoint is. (An earlier draft of this proposal specified a kernel-side superset check; Phase 1 established it was unnecessary rather than skipped.)

Non-Goals

Code hot-patching. No binary-level function replacement. Upgrade is at the process boundary, not the symbol boundary.
Kernel live replacement. Covered by Reboot-Proof / process persistence (reboot with state preserved, not live replacement). The kernel is a single trust domain; replacing it in place needs a different design.
Automatic schema migration across incompatible changes. If v2’s state schema is not a capnp-evolution-compatible superset of v1’s, the service author writes the migration. The kernel does not.
System-wide registry of upgradable services. The supervisor knows what it spawned; there is no ambient discovery.

Phased Implementation

CapRetarget primitive. Kernel operation + ProcessControl cap. Useful immediately for stateless services (Case 1) and as the foundation of Fault Containment (respawn with a new process, point its caps to a fresh instance). Done – proof make run-cap-retarget shows one client CapId answered by v1, then by v2 after the move, with the call it left queued inherited rather than disconnected, and a graceful retarget refused while v1 still holds a call in flight. make run-cap-retarget-force proves force too: at the same non-quiesced state graceful refuses and force succeeds, the in-flight caller observes the typed CAP_ERR_SERVER_DIED disconnect, and its CapId stays valid so the retry is served by v2.

One gap the proof surfaced, which Phase 2 should close: the kernel moves the owner slot into the successor’s capability table, but the successor has no way to discover it. A process cannot enumerate its own table (CapabilityManager is parent-side only; CapSet reflects only the boot-time bootstrap page), so today the supervisor must diff the successor’s table across the retarget and name the resulting CapId to it over a side channel – which is what the proof’s ctrl endpoint is. That is workable for a supervisor, which holds the successor’s CapabilityManager by construction, but it makes the successor’s serve loop depend on an out-of-band integer. The natural fix belongs with the Upgradable contract: resume should carry the retargeted slots, so the successor is told what it now serves as part of the protocol rather than inferring it.
Upgradable interface. Schema, contract documentation, and a Rust helper in capos-rt that services derive.
Graceful drain. Quiesce + in-flight call completion + v1 exit synchronization.
Stateful demo. A service maintaining session state, upgraded live with zero session loss. This is the Live Upgrade observable milestone.

Erlang/OTP code_change/3 is the closest prior art: processes upgrade their behavior module in place, with a callback to migrate state. capOS differs only in that state transport goes through capnp rather than Erlang term format, and that the process boundary is an OS process rather than a BEAM process.
Fuchsia component updates rebind component instances in the routing graph. Similar primitive in a different mechanism.
nginx -s reload is graceful restart for request/response servers. The design here generalizes it by exposing the state migration point explicitly rather than relying on “the session is the request.”

Cross-Links

service-architecture-proposal.md — owns the supervisor surface that drives this proposal’s protocol. The “Supervisors” and “Supervision Tree” sections describe the principal that holds ProcessControl caps to both old and new and runs spawn → quiesce → resume → retargetCaps → drain → exit. The “Service Taxonomy” entry Upgrade manager is the per-system orchestrator that consumes CapRetarget for live replacement, distinct from a per-subtree supervisor that uses the same primitive for fault containment (respawn after crash). Schema compatibility for new vs old is the same superset check the manifest executor and the boot package contract already require, not a new policy invented here.
cloud-deployment-proposal.md — owns the binary delivery story this proposal depends on. new must be obtained from the same content-addressed boot package / image-update pipeline the cloud deployment plan describes, not from an ad-hoc path. Cloud-managed services (KMS clients, metadata agents, log/metric shippers, the cloud-metadata agent itself) are exactly the Case 2 / Case 3 services where this proposal’s value shows up first: they hold long-lived caps to upstream cloud APIs, and a restart that drops those caps either re-runs IAM/JWT handshakes or, worse, drops audit/log shippers’ in-flight buffers. The bootable disk image / NVMe path defines what “update the binary” means on real hardware; until then the manifest-embedded BootPackage blobs are the only source of new.
storage-and-naming-proposal.md — owns the Case 2 holders (Store, SessionMap, Namespace) the idiomatic service factoring relies on, and the future sealed/stored capability path that lets state survive across reboot, not just across live upgrade. Case 3 state-transfer is the strictly weaker contract: same capnp wire format, but the snapshot only has to outlive a single retargetCaps call, not power loss.
system-monitoring-proposal.md — quiesce start, resume completion, retargetCaps mode (graceful vs force), drain duration, and rollback (kill new, resume old) are audit-worthy lifecycle events. The upgrade manager emits them through the audit cap so an operator can correlate a service binary change with downstream behavior. Graceful upgrades by definition emit zero DISCONNECTED CQEs; force-mode and fault-containment respawns do, and that distinction is what the audit record has to preserve.
security-and-verification-proposal.md — retargetCaps is a natural target for bounded modeling: per-CapId atomicity (no SQE submitted before retarget lands on new; no SQE submitted after lands on old) and, once Phase 3 lands, graceful-mode in-flight completion (old’s ring drains before exit). Phase 1’s fail-closed gate is the narrower property worth modeling first: that no call can be delivered to old between the quiesce check and the owner-slot move. Force-mode DISCONNECTED delivery is the same epoch-revocation path the fault-containment story already needs, not a separate kernel surface.
../design-risks-register.md — the register carries no dedicated R-entry for live upgrade. That was intentional while no implementation existed; Phase 1 has since landed, so the reasoning now rests on scope rather than absence. What landed is a fail-closed primitive: it refuses rather than stranding, moves nothing on any failure, and is proven by make run-cap-retarget. The long-horizon risks the register would track — graceful drain outliving the per-process release path, and the bounded model of retarget atomicity — attach to Phases 2-4, which are still design. The closest cross-cutting entries remain R6 (CAP_OP_RELEASE is deferred), because graceful drain has to outlive the per-process release path before v1.exit() is safe; R12 (verification coverage is partial), because the per-CapId retarget atomicity and graceful-drain invariants belong in a bounded model; and Q7 (revocation strategy), because force-mode retarget shares the epoch path the open revocation decision will pick. Open a dedicated R-entry when Phase 3 makes graceful-drain shutdown real, at which point drain and the supervisor-only authority constraint become long-horizon design surfaces in their own right.

Keyboard shortcuts

capOS Documentation