Proposal: Live Upgrade
Replacing a running service with a new binary, without dropping outstanding
capability references or losing in-flight work. The kernel-side primitive
(CapRetarget) is owned by this proposal; the surrounding orchestration
(supervisors, manifest sources, fault containment) is owned by
service-architecture-proposal.md and consumes the primitive defined here.
Problem
In a Linux-like system, “upgrading a service” is one of:
- Restart: stop the old process, start the new one. Clients holding
file descriptors, sockets, or pipes to the old process receive
ECONNRESETorEPIPEand must reconnect. Session state is lost unless clients serialize it themselves. - Graceful restart (nginx
-s reload, unicorn, systemd socket activation): new process starts alongside old, inherits the listening socket, old drains in-flight requests. Works only for request/response protocols where the session is the request. Does nothing for stateful sessions. - Live patch (kpatch, ksplice): binary-level function replacement. Narrow, fragile, no schema for state layout changes.
None of these compose with a capability OS. A CapId held by a client
points at a specific process; if that process exits, the cap is dead.
There is no “the service” abstraction the kernel could re-bind — the
point of capabilities is that they identify a specific reference, not
a name that could be redirected after the fact.
But capOS has a kernel-side primitive the Linux model lacks: the kernel
already owns the authoritative table of every CapId and which process
serves it. Rewriting “cap X is served by process v1” → “cap X is served
by process v2” is a table update. The question is when it is safe, and
how v2 inherits enough state to answer the next call.
Three Cases
Live upgrade has three distinct cost profiles. The right design is to make each one explicit rather than pretend the hard case doesn’t exist.
Case 1: Stateless services
Each SQE is independent; the service holds no state that matters across calls. A request router, a pure codec, a logger that flushes to an external sink.
Upgrade is trivial: start v2, retarget every CapId from v1 to v2,
exit v1. Clients may observe a small latency spike; no DISCONNECTED
CQE fires. Only the kernel primitive is needed.
Case 2: State externalized into other caps
The service’s in-memory data is a cache or dispatch table; durable state
lives behind caps the service holds (Store, SessionMap, Namespace).
v1’s held caps are passed to v2 at spawn time (via the supervisor, per
the manifest), kernel retargets client caps, v1 exits.
Architecturally this is the idiomatic capOS pattern: services stay thin, state is factored into dedicated holders with their own caps. The Fetch/HttpEndpoint split in the service-architecture proposal already pushes in this direction. In that world, most services fall into this bucket by construction.
Case 3: Stateful services requiring migration
The service has in-memory state that matters: a JIT’s code cache, a codec’s ring buffer, a parser’s arena, session data not yet flushed. Upgrade requires v1 to hand its state to v2.
capOS’s contribution here is that the state wire format is already capnp — the same format the service uses for IPC. v1 serializes its state as a capnp message; v2 consumes it. There is no separate serialization layer to build and no opportunity for it to drift from the IPC format.
The contract extends the service’s capnp interface:
interface Upgradable {
# Called on v1 by the supervisor. Returns a snapshot of service
# state and stops accepting new calls. Calls already in flight
# complete before the snapshot returns.
quiesce @0 () -> (state :Data);
# Called on v2 after spawn. Loads state from the snapshot. After
# this returns, v2 is ready to serve calls.
resume @1 (state :Data) -> ();
}
The state schema is service-defined. Schema evolution follows capnp’s standard rules: adding fields is backward-compatible, renaming requires care, removing requires a major version bump.
Kernel Primitive: CapRetarget
The kernel exposes the retarget as a capability method, not a syscall:
interface ProcessControl {
# Atomically redirect every CapId currently served by `old` to
# be served by `new`. Requires: `new` implements a schema
# superset of `old` (schema-id compatibility), `new` is Ready,
# `old` is Quiesced (graceful) or the caller has permission to
# force.
retargetCaps @0 (old :ProcessHandle, new :ProcessHandle,
mode :RetargetMode) -> ();
}
enum RetargetMode {
graceful @0; # old must be Quiesced; in-flight calls drain on old
force @1; # caps redirect immediately; in-flight calls fail
}
Only a process holding a ProcessControl cap to both processes can
perform this — typically the supervisor that spawned them. The kernel
never initiates upgrades.
Atomicity is per-CapId. From a client’s perspective, the retarget is a
single point in time: a CALL SQE submitted before retarget goes to v1;
a CALL SQE submitted after goes to v2. A CALL already dispatched to v1
either completes there (graceful) or returns a DISCONNECTED CQE
(force).
Supervisor-Level Upgrade Protocol
The primitives above compose into a protocol the supervisor runs:
1. spawn v2 from the new binary in the manifest
2. Case 1 & 2: v2.resume(EMPTY_STATE)
Case 3: state = v1.quiesce()
v2.resume(state)
3. kernel.retargetCaps(v1, v2, graceful)
4. wait for v1 to drain (graceful mode)
5. v1.exit()
If any step fails, the supervisor rolls back: kill v2, resume v1 (if quiesced), log the failure. Because the retarget hasn’t happened yet, clients never observe the aborted attempt.
In-Flight Calls
The subtle case is a client that has already posted a CALL SQE to v1 when the retarget happens. Two options:
- Graceful mode. v1 finishes the call, kernel routes the CQE back to the client on v1’s ring. v1 exits only after its ring is empty. This preserves call semantics; v1 and v2 coexist briefly.
- Force mode. The in-flight CALL returns
DISCONNECTED. Client retries against v2. Appropriate when v1 is wedged andquiescewon’t return.
In graceful mode the client cannot distinguish “call landed on v1” from “call landed on v2” — which is the point. Capability identity survives the upgrade; process identity does not.
Relationship to Fault Containment
Live upgrade and fault containment (driver panics → supervisor respawns) share machinery. The difference is one step of the protocol:
- Fault containment: v1 has crashed; kernel has already marked it
dead and epoch-bumped its caps. Supervisor spawns v2, issues a
graceful retarget (no quiesce — v1 is gone; in-flight CALLs already
delivered
DISCONNECTED). Clients reconnect to v2. - Live upgrade: v1 is healthy; supervisor initiates
quiesce→ state transfer → retarget, and no CQE ever reportsDISCONNECTEDto any caller.
The epoch-based revocation work from Stage 6 is the foundation for both. CapRetarget is one additional primitive layered on top.
Security and Trust
Live upgrade does not expand the trust model. The supervisor already holds the authority to kill, restart, and reassign caps for services it spawned — upgrade is a refinement of that authority, not a new principal. Requirements:
- Only a holder of
ProcessControlcaps to botholdandnewcan callretargetCaps. By construction this is the supervisor that spawned them. - The new binary must be legitimately obtained — in practice, loaded from the same content-addressed store as everything else (ties to Content-Addressed Boot).
- Schema compatibility (
newis a superset ofold) is checked by the kernel before retarget. This prevents an upgrade from silently narrowing the interface clients depend on.
Non-Goals
- Code hot-patching. No binary-level function replacement. Upgrade is at the process boundary, not the symbol boundary.
- Kernel live replacement. Covered by Reboot-Proof / process persistence (reboot with state preserved, not live replacement). The kernel is a single trust domain; replacing it in place needs a different design.
- Automatic schema migration across incompatible changes. If v2’s state schema is not a capnp-evolution-compatible superset of v1’s, the service author writes the migration. The kernel does not.
- System-wide registry of upgradable services. The supervisor knows what it spawned; there is no ambient discovery.
Phased Implementation
- CapRetarget primitive. Kernel operation +
ProcessControlcap. Useful immediately for stateless services (Case 1) and as the foundation of Fault Containment (respawn with a new process, point its caps to a fresh instance). - Upgradable interface. Schema, contract documentation, and a
Rust helper in
capos-rtthat services derive. - Graceful drain. Quiesce + in-flight call completion + v1 exit synchronization.
- Stateful demo. A service maintaining session state, upgraded live with zero session loss. This is the Live Upgrade observable milestone.
Related Work
- Erlang/OTP
code_change/3is the closest prior art: processes upgrade their behavior module in place, with a callback to migrate state. capOS differs only in that state transport goes through capnp rather than Erlang term format, and that the process boundary is an OS process rather than a BEAM process. - Fuchsia component updates rebind component instances in the routing graph. Similar primitive in a different mechanism.
- nginx
-s reloadis graceful restart for request/response servers. The design here generalizes it by exposing the state migration point explicitly rather than relying on “the session is the request.”
Cross-Links
service-architecture-proposal.md— owns the supervisor surface that drives this proposal’s protocol. The “Supervisors” and “Supervision Tree” sections describe the principal that holdsProcessControlcaps to botholdandnewand runs spawn →quiesce→resume→retargetCaps→ drain →exit. The “Service Taxonomy” entry Upgrade manager is the per-system orchestrator that consumesCapRetargetfor live replacement, distinct from a per-subtree supervisor that uses the same primitive for fault containment (respawn after crash). Schema compatibility fornewvsoldis the same superset check the manifest executor and the boot package contract already require, not a new policy invented here.cloud-deployment-proposal.md— owns the binary delivery story this proposal depends on.newmust be obtained from the same content-addressed boot package / image-update pipeline the cloud deployment plan describes, not from an ad-hoc path. Cloud-managed services (KMS clients, metadata agents, log/metric shippers, the cloud-metadata agent itself) are exactly the Case 2 / Case 3 services where this proposal’s value shows up first: they hold long-lived caps to upstream cloud APIs, and a restart that drops those caps either re-runs IAM/JWT handshakes or, worse, drops audit/log shippers’ in-flight buffers. The bootable disk image / NVMe path defines what “update the binary” means on real hardware; until then the manifest-embeddedBootPackageblobs are the only source ofnew.storage-and-naming-proposal.md— owns the Case 2 holders (Store,SessionMap,Namespace) the idiomatic service factoring relies on, and the future sealed/stored capability path that lets state survive across reboot, not just across live upgrade. Case 3 state-transfer is the strictly weaker contract: same capnp wire format, but the snapshot only has to outlive a singleretargetCapscall, not power loss.system-monitoring-proposal.md—quiescestart,resumecompletion,retargetCapsmode (graceful vs force), drain duration, and rollback (killnew, resumeold) are audit-worthy lifecycle events. The upgrade manager emits them through the audit cap so an operator can correlate a service binary change with downstream behavior. Graceful upgrades by definition emit zeroDISCONNECTEDCQEs; force-mode and fault-containment respawns do, and that distinction is what the audit record has to preserve.security-and-verification-proposal.md—retargetCapsis a natural target for bounded modeling: per-CapId atomicity (no SQE submitted before retarget lands onnew; no SQE submitted after lands onold), graceful-mode in-flight completion (old’s ring drains beforeexit), and schema-superset enforcement at the kernel before retarget. Force-modeDISCONNECTEDdelivery is the same epoch-revocation path the fault-containment story already needs, not a separate kernel surface.../design-risks-register.md— the register currently carries no dedicated R-entry for live upgrade, which is intentional: no implementation exists yet. The closest cross-cutting entries are R6 (CAP_OP_RELEASEis deferred), because graceful drain has to outlive the per-process release path beforev1.exit()is safe; R12 (verification coverage is partial), because the per-CapId retarget atomicity and graceful-drain invariants belong in a bounded model before this lands; and Q7 (revocation strategy), because force-mode retarget shares the epoch path the open revocation decision will pick. Open a dedicated R-entry onceCapRetargetlands in code, since at that point retarget atomicity, graceful-drain shutdown, and the supervisor-only authority constraint become long-horizon design surfaces in their own right.