# Proposal: Live Upgrade

Replacing a running service with a new binary, without dropping outstanding
capability references or losing in-flight work. The kernel-side primitive
(`CapRetarget`) is owned by this proposal; the surrounding orchestration
(supervisors, manifest sources, fault containment) is owned by
`service-architecture-proposal.md` and consumes the primitive defined here.


## Problem

In a Linux-like system, "upgrading a service" is one of:

- *Restart:* stop the old process, start the new one. Clients holding
  file descriptors, sockets, or pipes to the old process receive
  `ECONNRESET` or `EPIPE` and must reconnect. Session state is lost
  unless clients serialize it themselves.
- *Graceful restart* (nginx `-s reload`, unicorn, systemd socket
  activation): new process starts alongside old, inherits the listening
  socket, old drains in-flight requests. Works only for request/response
  protocols where the session *is* the request. Does nothing for stateful
  sessions.
- *Live patch* (kpatch, ksplice): binary-level function replacement.
  Narrow, fragile, no schema for state layout changes.

None of these compose with a capability OS. A `CapId` held by a client
points at a specific process; if that process exits, the cap is dead.
There is no "the service" abstraction the kernel could re-bind — the
point of capabilities is that they identify a specific reference, not
a name that could be redirected after the fact.

But capOS has a kernel-side primitive the Linux model lacks: the kernel
already owns the authoritative table of every `CapId` and which process
serves it. Rewriting "cap X is served by process v1" → "cap X is served
by process v2" is a table update. The question is when it is safe, and
how v2 inherits enough state to answer the next call.

## Three Cases

Live upgrade has three distinct cost profiles. The right design is to
make each one explicit rather than pretend the hard case doesn't exist.

### Case 1: Stateless services

Each SQE is independent; the service holds no state that matters across
calls. A request router, a pure codec, a logger that flushes to an
external sink.

Upgrade is trivial: start v2, retarget every `CapId` from v1 to v2,
exit v1. Clients may observe a small latency spike; no `DISCONNECTED`
CQE fires. Only the kernel primitive is needed.

### Case 2: State externalized into other caps

The service's in-memory data is a cache or dispatch table; durable state
lives behind caps the service holds (`Store`, `SessionMap`, `Namespace`).
v1's held caps are passed to v2 at spawn time (via the supervisor, per
the manifest), kernel retargets client caps, v1 exits.

Architecturally this is the idiomatic capOS pattern: services stay thin,
state is factored into dedicated holders with their own caps. The
Fetch/HttpEndpoint split in the service-architecture proposal already
pushes in this direction. In that world, most services fall into this
bucket by construction.

### Case 3: Stateful services requiring migration

The service has in-memory state that matters: a JIT's code cache, a
codec's ring buffer, a parser's arena, session data not yet flushed.
Upgrade requires v1 to hand its state to v2.

capOS's contribution here is that *the state wire format is already
capnp* — the same format the service uses for IPC. v1 serializes its
state as a capnp message; v2 consumes it. There is no separate
serialization layer to build and no opportunity for it to drift from
the IPC format.

The contract extends the service's capnp interface:

```capnp
interface Upgradable {
    # Called on v1 by the supervisor. Returns a snapshot of service
    # state and stops accepting new calls. Calls already in flight
    # complete before the snapshot returns.
    quiesce @0 () -> (state :Data);

    # Called on v2 after spawn. Loads state from the snapshot. After
    # this returns, v2 is ready to serve calls.
    resume @1 (state :Data) -> ();
}
```

The state schema is service-defined. Schema evolution follows capnp's
standard rules: adding fields is backward-compatible, renaming requires
care, removing requires a major version bump.

## Kernel Primitive: CapRetarget

The kernel exposes the retarget as a capability method, not a syscall:

```capnp
interface ProcessControl {
    # Atomically redirect every CapId currently served by `old` to
    # be served by `new`. Requires: `new` implements a schema
    # superset of `old` (schema-id compatibility), `new` is Ready,
    # `old` is Quiesced (graceful) or the caller has permission to
    # force.
    retargetCaps @0 (old :ProcessHandle, new :ProcessHandle,
                     mode :RetargetMode) -> ();
}

enum RetargetMode {
    graceful @0;  # old must be Quiesced; in-flight calls drain on old
    force    @1;  # caps redirect immediately; in-flight calls fail
}
```

Only a process holding a `ProcessControl` cap to both processes can
perform this — typically the supervisor that spawned them. The kernel
never initiates upgrades.

Atomicity is per-CapId. From a client's perspective, the retarget is a
single point in time: a CALL SQE submitted before retarget goes to v1;
a CALL SQE submitted after goes to v2. A CALL already dispatched to v1
either completes there (graceful) or returns a `DISCONNECTED` CQE
(force).

## Supervisor-Level Upgrade Protocol

The primitives above compose into a protocol the supervisor runs:

```
1. spawn v2 from the new binary in the manifest
2. Case 1 & 2: v2.resume(EMPTY_STATE)
   Case 3:     state = v1.quiesce()
               v2.resume(state)
3. kernel.retargetCaps(v1, v2, graceful)
4. wait for v1 to drain (graceful mode)
5. v1.exit()
```

If any step fails, the supervisor rolls back: kill v2, resume v1 (if
quiesced), log the failure. Because the retarget hasn't happened yet,
clients never observe the aborted attempt.

## In-Flight Calls

The subtle case is a client that has already posted a CALL SQE to v1
when the retarget happens. Two options:

- **Graceful mode.** v1 finishes the call, kernel routes the CQE back
  to the client on v1's ring. v1 exits only after its ring is empty.
  This preserves call semantics; v1 and v2 coexist briefly.
- **Force mode.** The in-flight CALL returns `DISCONNECTED`. Client
  retries against v2. Appropriate when v1 is wedged and `quiesce`
  won't return.

In graceful mode the client cannot distinguish "call landed on v1" from
"call landed on v2" — which is the point. Capability identity survives
the upgrade; process identity does not.

## Relationship to Fault Containment

Live upgrade and fault containment (driver panics → supervisor
respawns) share machinery. The difference is one step of the protocol:

- *Fault containment:* v1 has crashed; kernel has already marked it
  dead and epoch-bumped its caps. Supervisor spawns v2, issues a
  *graceful* retarget (no quiesce — v1 is gone; in-flight CALLs already
  delivered `DISCONNECTED`). Clients reconnect to v2.
- *Live upgrade:* v1 is healthy; supervisor initiates `quiesce` → state
  transfer → retarget, and no CQE ever reports `DISCONNECTED` to any
  caller.

The epoch-based revocation work from Stage 6 is the foundation for
both. CapRetarget is one additional primitive layered on top.

## Security and Trust

Live upgrade does not expand the trust model. The supervisor already
holds the authority to kill, restart, and reassign caps for services
it spawned — upgrade is a refinement of that authority, not a new
principal. Requirements:

- Only a holder of `ProcessControl` caps to both `old` and `new` can
  call `retargetCaps`. By construction this is the supervisor that
  spawned them.
- The new binary must be legitimately obtained — in practice, loaded
  from the same content-addressed store as everything else (ties to
  Content-Addressed Boot).
- Schema compatibility (`new` is a superset of `old`) is checked by
  the kernel before retarget. This prevents an upgrade from silently
  narrowing the interface clients depend on.

## Non-Goals

- **Code hot-patching.** No binary-level function replacement. Upgrade
  is at the process boundary, not the symbol boundary.
- **Kernel live replacement.** Covered by Reboot-Proof / process
  persistence (reboot with state preserved, not live replacement).
  The kernel is a single trust domain; replacing it in place needs a
  different design.
- **Automatic schema migration across incompatible changes.** If v2's
  state schema is not a capnp-evolution-compatible superset of v1's,
  the service author writes the migration. The kernel does not.
- **System-wide registry of upgradable services.** The supervisor
  knows what it spawned; there is no ambient discovery.

## Phased Implementation

1. **CapRetarget primitive.** Kernel operation + `ProcessControl` cap.
   Useful immediately for stateless services (Case 1) and as the
   foundation of Fault Containment (respawn with a new process, point
   its caps to a fresh instance).
2. **Upgradable interface.** Schema, contract documentation, and a
   Rust helper in `capos-rt` that services derive.
3. **Graceful drain.** Quiesce + in-flight call completion + v1 exit
   synchronization.
4. **Stateful demo.** A service maintaining session state, upgraded
   live with zero session loss. This is the Live Upgrade observable
   milestone.

## Related Work

- **Erlang/OTP `code_change/3`** is the closest prior art: processes
  upgrade their behavior module in place, with a callback to migrate
  state. capOS differs only in that state transport goes through capnp
  rather than Erlang term format, and that the process boundary is an
  OS process rather than a BEAM process.
- **Fuchsia component updates** rebind component instances in the
  routing graph. Similar primitive in a different mechanism.
- **nginx `-s reload`** is graceful restart for request/response
  servers. The design here generalizes it by exposing the state
  migration point explicitly rather than relying on "the session is
  the request."

## Cross-Links

- **`service-architecture-proposal.md`** — owns the supervisor surface
  that drives this proposal's protocol. The "Supervisors" and
  "Supervision Tree" sections describe the principal that holds
  `ProcessControl` caps to both `old` and `new` and runs spawn →
  `quiesce` → `resume` → `retargetCaps` → drain → `exit`. The "Service
  Taxonomy" entry **Upgrade manager** is the per-system orchestrator
  that consumes `CapRetarget` for live replacement, distinct from a
  per-subtree supervisor that uses the same primitive for fault
  containment (respawn after crash). Schema compatibility for `new`
  vs `old` is the same superset check the manifest executor and the
  boot package contract already require, not a new policy invented
  here.
- **`cloud-deployment-proposal.md`** — owns the binary delivery story
  this proposal depends on. `new` must be obtained from the same
  content-addressed boot package / image-update pipeline the cloud
  deployment plan describes, not from an ad-hoc path. Cloud-managed
  services (KMS clients, metadata agents, log/metric shippers, the
  cloud-metadata agent itself) are exactly the Case 2 / Case 3
  services where this proposal's value shows up first: they hold
  long-lived caps to upstream cloud APIs, and a restart that drops
  those caps either re-runs IAM/JWT handshakes or, worse, drops
  audit/log shippers' in-flight buffers. The bootable disk image
  / NVMe path defines what "update the binary" means on real
  hardware; until then the manifest-embedded `BootPackage` blobs are
  the only source of `new`.
- **`storage-and-naming-proposal.md`** — owns the Case 2 holders
  (`Store`, `SessionMap`, `Namespace`) the idiomatic service
  factoring relies on, and the future sealed/stored capability path
  that lets state survive across reboot, not just across live
  upgrade. Case 3 state-transfer is the strictly weaker contract:
  same capnp wire format, but the snapshot only has to outlive a
  single `retargetCaps` call, not power loss.
- **`system-monitoring-proposal.md`** — `quiesce` start, `resume`
  completion, `retargetCaps` mode (graceful vs force), drain
  duration, and rollback (kill `new`, resume `old`) are
  audit-worthy lifecycle events. The upgrade manager emits them
  through the audit cap so an operator can correlate a service
  binary change with downstream behavior. Graceful upgrades by
  definition emit zero `DISCONNECTED` CQEs; force-mode and
  fault-containment respawns do, and that distinction is what the
  audit record has to preserve.
- **`security-and-verification-proposal.md`** — `retargetCaps` is a
  natural target for bounded modeling: per-CapId atomicity (no SQE
  submitted before retarget lands on `new`; no SQE submitted after
  lands on `old`), graceful-mode in-flight completion (`old`'s
  ring drains before `exit`), and schema-superset enforcement at the
  kernel before retarget. Force-mode `DISCONNECTED` delivery is the
  same epoch-revocation path the fault-containment story already
  needs, not a separate kernel surface.
- **`../design-risks-register.md`** — the register currently carries
  no dedicated R-entry for live upgrade, which is intentional: no
  implementation exists yet. The closest cross-cutting entries are
  **R6** (`CAP_OP_RELEASE` is deferred), because graceful drain has
  to outlive the per-process release path before `v1.exit()` is
  safe; **R12** (verification coverage is partial), because the
  per-CapId retarget atomicity and graceful-drain invariants belong
  in a bounded model before this lands; and **Q7** (revocation
  strategy), because force-mode retarget shares the epoch path the
  open revocation decision will pick. Open a dedicated R-entry once
  `CapRetarget` lands in code, since at that point retarget
  atomicity, graceful-drain shutdown, and the supervisor-only
  authority constraint become long-horizon design surfaces in their
  own right.