# Proposal: Volume Encryption

Encrypting system and user volumes in a capability OS where storage is
already a stack of typed capabilities and keys can be first-class
capability objects.


## Problem

capOS currently has no persistent storage, no crypto, no TPM driver, and
no block-device drivers. That is the right moment to decide what
encryption-at-rest looks like, before storage interfaces and service
graphs harden around plaintext assumptions.

Traditional OSes bolt encryption on as a kernel subsystem
(`dm-crypt`/LUKS, BitLocker, FileVault, fscrypt). That choice follows
from those kernels' architecture: the kernel owns block I/O, the
filesystem, the keyring, and the trust domain between processes, so
encryption logically lives there too. capOS has made the opposite bet —
the kernel is a capability router, block I/O lives in userspace
services, filesystems are userspace services, and there is no ambient
keyring because there is no ambient anything.

Putting crypto in the kernel would contradict Design Principle 5
("the kernel is becoming a capnp-rpc router") and Principle 7
("pragmatic reuse" — let userspace crates do what they already do
well). Putting it nowhere leaves the system unable to protect data at
rest. The proposal below places encryption in userspace services
expressed as capabilities, with no new kernel mechanism.

## Threat Model

Four attackers worth distinguishing up front, because the defenses
differ:

1. **Offline disk theft.** Attacker has the storage medium, no live
   system, no running key service, possibly no hardware attestation.
   Ciphertext must reveal nothing about plaintext beyond length and
   block boundaries.
2. **Ciphertext tampering at rest.** Attacker can write to the medium
   and hopes to flip ciphertext bits to produce attacker-chosen
   plaintext changes (classic XTS malleability). Modification must be
   *detected*, not merely scrambled.
3. **Peer userspace service holding the raw `BlockDevice` cap.** The
   virtio-blk driver, a backup agent, a telemetry exporter, or any
   service that is on the physical I/O path. They hold authority to
   read sectors but must not see plaintext for volumes whose key they
   do not hold.
4. **Compromised session with a live key cap.** Once an attacker is
   inside a user's session and holds the user's `SymmetricKey` cap, that user's
   data is lost. The goal is *lateral* containment: no cross-user
   leverage, no escalation to the system volume, no access to other
   sessions' keys.

Out of scope for a first pass:
- Cold-boot RAM attacks and side channels (mitigation: use TPM-bound
  keys when available, but physical memory reads against a running
  host are not defended).
- Evil-maid attacks on the unencrypted portion of the boot image
  (addressed separately by secure boot / measured boot — see
  [storage-and-naming-proposal.md](storage-and-naming-proposal.md)
  Open Question #5).
- Traffic analysis against encrypted backups or encrypted replication.
- Key escrow for legal recovery. capOS takes no position; a deployment
  can add an escrow `KeySource` without changing the model.

## Keys Are Capabilities

Key material never crosses cap boundaries. Callers hold
`SymmetricKey` or `PrivateKey` capabilities whose methods run inside
the service that holds the key; the holder gets encrypt/decrypt/sign
authority, not the bytes. Attenuation (decrypt-only, AAD-pinned,
purpose-bound) is wrapper CapObjects, the same mechanism that builds
read-only Files.

This proposal does not define those interfaces. They belong to
[cryptography-and-key-management-proposal.md](cryptography-and-key-management-proposal.md),
which covers `SymmetricKey`, `PrivateKey`/`PublicKey`, `KeySource`,
`KeyVault`, algorithm and purpose enums, seal policies, and the set
of concrete key sources (manifest-embedded, passphrase, passkey PRF,
TPM 2.0, cloud KMS, attestation, network, software-stored). Volume
encryption is one consumer among many.

## Layer Placement

Two layers exist, and a first-class design uses both.

### Layer A — `EncryptedBlockDevice` (LUKS analog)

A userspace service holds two caps — `BlockDevice` (raw) and
`SymmetricKey` — and exports a new `BlockDevice` cap that looks
identical to its input but encrypts writes and decrypts reads
transparently. Everything above the wrapper (filesystems, the Store
service, content-addressed backends) is oblivious.

```
Raw block device
  → virtio-blk / NVMe driver → BlockDevice cap (ciphertext)
    → EncryptedBlockDevice service holds [BlockDevice + SymmetricKey]
      → BlockDevice cap (plaintext-view)
        → FAT / ext4 / Store service
          → File / Directory / Namespace caps
            → App
```

Properties:

- One key per volume (or per-range, see "Key hierarchy" below).
- Granularity is a sector/block. Metadata in the filesystem layer is
  encrypted along with data — the shape of the directory tree is
  invisible to threat #3.
- Incompatible with zero-copy device DMA into user pages (see
  "SharedBuffer" below).

Layer A defends against threats #1, #2, and #3.

### Layer B — per-user `Namespace` / `Directory` encryption (fscrypt analog)

Layered above a filesystem or Store, Layer B encrypts object contents
and, optionally, object names, using a per-user key. The underlying
block device may or may not also be encrypted.

```
BlockDevice (ciphertext or plaintext)
  → Store service → Store/Namespace caps (ciphertext objects)
    → EncryptedNamespace service holds [Namespace + UserKey]
      → Namespace cap (plaintext-view)
        → User's session services
```

Properties:

- One key per user (or per session, per device, per tenant).
- Metadata at the filesystem/Store layer is visible to threat #3
  unless Layer A is also in place.
- Cap boundaries are naturally per-user — revocation is "drop the
  cap," no filesystem rekeying.
- Compatible with shared filesystems across users (per-entry
  encryption).

Layer B defends primarily against #4-lateral (a compromise of user
Bob's session does not reveal user Alice's data) and against a
compromised *shared* filesystem service when the underlying block
layer is unencrypted.

### Recommendation

Use both. Layer A for the system volume and for the per-tenant block
substrate in multi-tenant deployments; Layer B for per-user data on
top of a shared filesystem or store. Users who run single-tenant
desktops can skip B. Cloud VMs that rely on provider-side encryption
of block storage (see "Cloud integration") can skip A and keep B.
The proposal does not mandate either layer; it standardizes the
interface so both compose.

## Volume-Specific Schemas

`SymmetricKey`, `KeySource`, `KeyAlgorithm`, `KeyPurpose`, and
`SealPolicy` are defined in
[cryptography-and-key-management-proposal.md](cryptography-and-key-management-proposal.md).
This proposal adds only the wrapper-factory and on-disk-format
schemas.

### `EncryptedBlockDevice`

Exposes nothing new — it implements the existing `BlockDevice`
interface. The distinction is where it sits in the cap graph. A
factory cap creates it:

```capnp
interface EncryptedBlockDeviceFactory {
    open @0 (raw :BlockDevice, key :SymmetricKey, format :VolumeFormat)
         -> (plain :BlockDevice);
    format @1 (raw :BlockDevice, key :SymmetricKey, params :FormatParams)
           -> (plain :BlockDevice);
}

struct VolumeFormat {
    superblock     @0 :Data;  # read from raw device during open()
    algorithm      @1 :SymmetricAlgorithm;  # defined in key-management proposal
    sectorSize     @2 :UInt32;
    tagAreaLayout  @3 :TagAreaLayout;
}
```

## Cryptographic Construction

Two separate questions — block layer and object layer — with different
answers.

### Block layer (Layer A)

*Requirement:* authenticate every block. XTS alone is not enough; it
defends against #1 but not #2.

*Shortlist:*

- **AES-256-GCM-SIV with LBA-derived nonce + separate tag area.** The
  nonce is `HMAC(K_nonce, LBA)` (deterministic, no extra storage). The
  tag (128 bits) is stored in a reserved tag area, either a sidecar
  journal (dm-integrity style) or a reserved footer per block group.
  Cost: ~3% storage overhead for the tag, one extra read/write to the
  tag area per I/O (usually absorbed by sector grouping). Defends
  against #1 and #2.
- **XChaCha20-Poly1305 with random nonce + tag.** Same tag-storage
  problem as GCM-SIV; XChaCha's 192-bit nonce removes nonce-reuse
  concerns entirely. Slower than AES on hardware that has AES-NI,
  faster on hardware that doesn't (e.g. low-end ARM).
- **AES-256-XTS alone.** The LUKS1/LUKS2 default. Reject this as the
  sole defense; it fails #2. May still be useful as a building block
  under an external MAC (dm-integrity + dm-crypt in Linux).
- **Wide-block constructions (HCTR2, Adiantum).** Length-preserving,
  no MAC. Better diffusion than XTS but still fail #2. Useful only
  when storage overhead for tags is unacceptable and tamper-detection
  is being provided elsewhere.

**Recommendation:** AES-256-GCM-SIV with LBA-derived nonce and a
dedicated tag area, fallback to XChaCha20-Poly1305 on hardware without
AES-NI. Document the tag-area layout in `VolumeFormat`; don't invent a
scheme per deployment.

### Object layer (Layer B)

*Requirement:* per-object authentication; compatibility with
content-addressed storage where possible.

Options, with the honest tradeoffs:

- **Per-tenant keys, `hash(ciphertext)` as address.** Each user's Store
  encrypts with their key. Dedup works within a volume, not across.
  Metadata (object size, access patterns) is visible to a peer holding
  the backing `BlockDevice`. This is the **recommended default**.
- **Per-tenant keys, `HMAC(K, plaintext)` as address.** Address derived
  deterministically from plaintext allows a user to look up their own
  objects by plaintext hash without scanning. Same cross-tenant
  properties as above.
- **Convergent encryption (key = `hash(plaintext)`).** Global dedup
  across users, but leaks equality: "user X holds the same file as
  user Y." Rejected as a default; too much leakage for a
  capability-based OS that treats ambient authority as a bug.

All three use an AEAD (GCM-SIV or XChaCha20-Poly1305) per object with
a random nonce stored with the object.

## System Volume Flow

1. Boot firmware loads Limine, which loads the kernel + init + boot
   services from an unencrypted boot partition.
2. Kernel spawns init. Init spawns a minimal service graph: block
   device driver, console service, `KeySource` service (one of
   passphrase / TPM / cloud KMS / manifest-embedded), and the
   `EncryptedBlockDeviceFactory` service.
3. Init obtains the unlock context. For interactive boot: read a
   passphrase via the console login flow in
   [boot-to-shell-proposal.md](boot-to-shell-proposal.md). For
   unattended boot: invoke TPM unseal, KMS decrypt, or an attestation
   protocol. Contexts that require networking (cloud KMS, Tang) come
   up after the network stack.
4. Init hands `(BlockDevice, SymmetricKey)` to `EncryptedBlockDeviceFactory.open`
   and receives a plaintext-view `BlockDevice`.
5. Init hands that `BlockDevice` to the filesystem or Store service,
   which becomes the *system* storage root.
6. Init pivots to the services graph baked in the now-readable system
   volume. Services that do not need direct I/O never see a raw
   `BlockDevice` and therefore never see ciphertext.

Analogous to Linux's `initramfs` pattern, but with capabilities
instead of `/dev` paths.

## User Volume Flow

1. User authenticates through the login flow in
   [boot-to-shell-proposal.md](boot-to-shell-proposal.md). Success
   yields a session and a `CredentialStore` response.
2. `SessionManager` invokes the user's `KeySource` — passkey PRF,
   password-derived, or cloud-held — yielding a user `SymmetricKey`.
3. `SessionManager` hands `(UserNamespace, UserKey)` to an
   `EncryptedNamespaceFactory.open` and receives a plaintext-view
   `Namespace`.
4. The plaintext Namespace is installed in the session's CapSet.
   Services in the session see only the user's decrypted view.
5. On logout, the session is torn down; the user `SymmetricKey` cap is
   released; the key service's in-process material is zeroized.
   `EncryptedNamespace` stops decrypting. Ciphertext remains intact
   on disk.

Revocation is a cap-drop, not a filesystem rekey.

## SharedBuffer and DMA

`SharedBuffer` (`docs/roadmap.md` Stage 6 / MemoryObject) exists so devices can
DMA directly into app pages. Software block encryption is inherently
incompatible with that: the device writes ciphertext; the app expects
plaintext.

Three honest answers:

1. **Extra copy.** Driver DMAs into a scratch page held by the
   `EncryptedBlockDevice` service, which decrypts into the app's
   `SharedBuffer`. One extra copy per I/O. Simple; correct; first
   implementation. Cost is dominated by the crypto itself, not the
   copy, for typical I/O sizes.
2. **Decrypt in place.** Device DMAs ciphertext into the app's
   `SharedBuffer`; the service decrypts it in-place before completion
   is posted. Saves a copy, keeps CPU crypto on the hot path, and
   complicates reuse of the buffer (the app sees ciphertext briefly,
   then plaintext). Viable once the buffer lifetime is
   well-specified.
3. **Hardware inline crypto.** NVMe OPAL, SED drives, Intel CSE,
   AES-XTS block engines on some ARM SoCs. Device sees the key; DMA
   paths see plaintext; software sees an unencrypted-looking device.
   Different trust model — the *device* is now in the TCB — and
   different key-provisioning story (IEEE 1667 / TCG Opal PSID). Note
   for future work; not a first-implementation target.

First implementation: #1. Revisit #2 when I/O performance matters.
Treat #3 as a separate capability shape (`SelfEncryptingBlockDevice`)
rather than a flag on the main interface.

## Boot Order and the Unencrypted Boot Partition

By construction there must be an unencrypted partition containing at
least: Limine, kernel, init, the block device driver, the key-source
service(s), the encrypted block device factory, and — if the key
source requires it — a minimal networking stack.

This partition is the trust root for the whole system. It does **not**
need to be encrypted, because its contents are either
integrity-protected by a measured-boot chain or considered public
anyway (the capOS binaries are open source). It **does** need to be
integrity-protected, which is secure boot / measured boot — addressed
in [storage-and-naming-proposal.md](storage-and-naming-proposal.md)
Open Question #5 and not duplicated here.

Relationship to that question: a TPM-sealed `KeySource` *requires*
measured boot to be useful. Without measurement, a tampered boot
partition can unseal the key under attacker-controlled code. A
passphrase `KeySource` does not require measured boot, only the
expectation that the user will notice if the boot UI looks wrong. A
cloud KMS `KeySource` relies on cloud-provider instance identity,
which is a parallel trust story (see below).

## Cloud Integration

Cloud environments change every part of this picture: the block device
is virtual, the key store is a network service, instance identity is
provider-signed, object storage exists as a first-class primitive, and
backups are a product, not a script. capOS should treat each of these
as a capability and reuse them.

### Cloud block storage (EBS, GCP Persistent Disk, Azure Disk)

These volumes are already encrypted at rest by the provider. The
question is *whose key* performs the encryption:

| Model                      | Provider sees plaintext?  | Customer controls key? | Customer does crypto? |
|---------------------------|---------------------------|------------------------|-----------------------|
| Provider-managed (default)| Yes (plaintext in volume) | No                     | No                    |
| Customer-managed (CMEK)   | Yes (plaintext in volume) | Yes (via KMS)          | No                    |
| Customer-supplied (CSEK)  | Briefly, during request   | Yes                    | No                    |
| Client-side (Layer A)     | No                        | Yes                    | Yes                   |

capOS's `BlockDevice` cap is indifferent to which of the first three
the provider is doing. For the fourth — client-side encryption — capOS
wraps the provider's `BlockDevice` cap in its own
`EncryptedBlockDevice`. The provider sees only ciphertext and cannot
read the volume even with a compelled-disclosure order.

Deployment guidance:

- **Untrusted provider / compliance-driven:** Layer A over cloud
  block storage. Provider-side encryption becomes a belt-and-braces
  redundancy.
- **Trusted provider / operational simplicity:** rely on CMEK, skip
  Layer A. Capability model still contains peer services — a
  compromised capOS service does not get raw block I/O unless it
  holds the cap.
- **Confidential-computing VMs (SEV-SNP / TDX / Nitro):** use Layer A
  with an attestation-gated `KeySource`. The attestation report
  proves the VM is genuine and running approved code; KMS releases
  the DEK only against a valid report.

### Cloud KMS (AWS KMS, GCP KMS, Azure Key Vault, Vault, …)

Envelope encryption is the universal pattern: the cloud KMS holds a
*key-encrypting key* (KEK) with tight IAM-bound access; the actual
data-encrypting key (DEK) is generated by capOS, wrapped by the KEK,
stored alongside the ciphertext, and unwrapped by KMS at unlock time.

Map to capabilities:

- A `CloudKmsKeySource` service implements `KeySource`. `unlock(blob)`
  sends the wrapped DEK to KMS for `Decrypt`, receives the plaintext
  DEK, constructs a local `SymmetricKey` cap around it, and returns it.
- The service authenticates to KMS using the VM's instance identity,
  obtained from a `CloudMetadata`-derived `InstanceIdentity` cap (see
  [cloud-metadata-proposal.md](cloud-metadata-proposal.md)). No
  long-lived credentials are baked into the image.
- `seal(key, KmsPolicy{kmsKeyId, grant})` calls KMS `Encrypt` to wrap
  the key under the named KEK and returns the opaque blob.
- KMS audit logs record every unwrap. This is a free observability
  win capOS inherits by delegation; nothing in the OS needs to log
  key usage separately.

Benefits of envelope encryption that capOS gets by following the
pattern:

- **Free KEK rotation.** Rotating the KEK requires only re-wrapping
  the DEK (fast, metadata-only). The DEK itself stays; the volume is
  not rewritten. A `rewrap` method on `KeySource` makes this explicit.
- **Revocation.** Disable the KMS key or revoke the IAM grant; the
  next `unlock` fails. Running instances with a cached DEK continue
  until reboot — matches Linux behavior.
- **Cross-region / cross-account access.** KMS grants move
  ciphertext-readable capability between accounts without handing
  over the key material. capOS reads that as "the receiving account
  holds a `KeySource` cap whose policy the grant satisfies."

Non-AWS KMS providers (Vault, HSM clusters, KMIP devices) fit the
same interface. The `CloudKmsKeySource` service name is a placeholder;
production likely wants one service per provider, or one generic
service with a provider-selection parameter.

### Instance identity and attestation

Cloud VMs authenticate to KMS without baked-in credentials because the
hypervisor signs identity tokens. AWS IMDSv2, GCP metadata
identity tokens, and Azure IMDS all produce short-lived signed JWTs.
Confidential-computing platforms extend this with hardware attestation
reports (SEV-SNP, TDX, Nitro).

An `InstanceIdentity` capability — carved out of
[cloud-metadata-proposal.md](cloud-metadata-proposal.md) — exposes
these token and attestation paths. Key-source services consume that
cap instead of pulling from an ambient metadata endpoint. Revoking a
service's access to the metadata service becomes a cap-graph edit:
no firewall rules, no iptables on `169.254.169.254`.

### OIDC-gated volume unlock (workload identity federation)

`InstanceIdentity` is the raw material. Modern clouds consume it
through OIDC token exchange (RFC 8693) rather than a provider-
specific identity API. That pattern is defined in
[oidc-and-oauth2-proposal.md](oidc-and-oauth2-proposal.md) as
`WorkloadIdentityFederation`; volume encryption consumes it through
`OidcFederatedKeySource` (see
[cryptography-and-key-management-proposal.md](cryptography-and-key-management-proposal.md)).

System-volume flow:

1. Boot the key-less image. `init` starts the block driver, the
   metadata service, and the OAuth service, but never holds raw
   cloud credentials.
2. `CloudMetadata` returns an `InstanceIdentity` cap (a signed JWT
   from the hypervisor).
3. `WorkloadIdentityFederation.exchange` posts that JWT to the cloud
   STS with `grant_type = urn:ietf:params:oauth:grant-type:token-exchange`
   and `subject_token_type = urn:ietf:params:oauth:token-type:jwt`.
   It receives a short-lived cloud access token bound to the
   instance's identity.
4. `OidcFederatedKeySource` uses that access token to authenticate a
   `Decrypt` call on the wrapped DEK at the cloud KMS. The plaintext
   DEK returns as a `SymmetricKey` cap.
5. `EncryptedBlockDeviceFactory.open` composes that key with the raw
   `BlockDevice` and returns a plaintext-view `BlockDevice`.

Per-user volume flow (Layer B):

1. Alice authenticates through console or web shell OIDC; the IdP
   issues an ID token and an access token.
2. `SessionManager` mints her `UserSession`; her `AccessToken` cap
   is handed to `OidcFederatedKeySource` wrapped inside the broker-
   returned session bundle — never as a bearer string.
3. The key service enforces
   `SealPolicy.tokenExchange { issuer, audience, subjectPattern,
   requiredClaims, minAuthStrength }`. It verifies the access token
   (or an ID token it exchanges for) against its pinned IdP
   trust record and only then releases Alice's DEK.
4. `EncryptedNamespaceFactory.open` yields Alice's plaintext
   namespace. Logout drops the cap; the in-process key material
   zeroizes.

Properties this adds on top of plain `CloudKmsKeySource`:

- **No long-lived IAM credentials anywhere in the image.** The
  historical instance-role access-key pair is gone; what remains
  is a short-lived access token tied to the live workload.
- **Audit keyed on principal.** Cloud KMS logs the OIDC `sub` of
  every Decrypt, so "Alice's laptop unlocked her volume at 09:14"
  is observable without extra audit glue.
- **Step-up authentication on the unlock path.**
  `TokenExchangePolicy.minAuthStrength` maps to X.1254 LoA. A volume
  requiring `loa3` cannot be unlocked by a passwords-only session.
- **Revocation through IdP or KMS.** Disable Alice at the IdP or
  revoke the IAM grant and the next unlock fails. Cached DEKs in
  running instances survive until reboot — identical to today's
  cloud KMS semantics but explicit.

### Token TTL vs. cached DEK

OIDC access tokens typically expire in minutes; DEKs typically live
for as long as a volume is mounted. `OidcFederatedKeySource.unlock`
is called once per mount; the DEK cap is held by the encrypted
block/namespace service until mount ends. Token expiry after unlock
does not re-lock the volume. This matches every other KMS-unwrap
pattern (`CloudKmsKeySource`, `Tpm2KeySource`), but it is worth
saying aloud: short-lived tokens give short-lived *authorization
freshness*, not short-lived *key availability*. Deployments that
want stricter revocation can:

- require periodic re-unlock (re-mount) via broker policy,
- keep the volume mounted read-only by default and require a fresh
  token for each write window,
- or use a confidential-computing + attestation-gated KEK that the
  hardware refuses to re-release on policy change.

### No baked credentials policy

The capOS ISO must contain neither a long-lived cloud IAM credential
nor a long-lived bearer token. `ManifestEmbeddedKeySource` remains
dev/CI only. Production builds pass through one of:
`Tpm2KeySource`, `AttestationKeySource`, `CloudKmsKeySource`
(instance-identity flow), or `OidcFederatedKeySource`
(workload-federation flow). The manifest validator should refuse a
production-profile image that embeds a symmetric volume key or a
long-lived cloud credential.

### Object storage (S3, GCS, Azure Blob)

Object storage is a natural backend for the capability-native
Store. The Store service holds an `S3Bucket` cap, serializes capnp
messages as S3 objects keyed by their content hash, and exports
`Store` / `Namespace` caps to clients.

Encryption trust tiers mirror block storage:

| Model                           | Provider sees plaintext? | Customer key? | Customer does crypto? |
|--------------------------------|--------------------------|---------------|-----------------------|
| SSE-S3                         | Yes                      | No            | No                    |
| SSE-KMS                        | Yes                      | Yes (KMS)     | No                    |
| SSE-C                          | Briefly                  | Yes           | No                    |
| Client-side (Layer B in Store) | No                       | Yes           | Yes                   |

Client-side is the interesting case for capOS. The content-addressed
Store can encrypt each blob with a per-tenant DEK before upload,
keying objects by `hash(ciphertext)` or `HMAC(K, plaintext)`. The DEK
is wrapped by cloud KMS; the bucket can be world-readable without
leaking plaintext. This is a deployment where "the provider stores our
data" and "the provider cannot read our data" coexist.

Nonce management across objects becomes the main design question.
Either:

- random 192-bit nonce per object (XChaCha), stored as an object
  header; or
- derived nonce from object identity (`HMAC(K_n, object_id)`),
  requires that the same plaintext object is never uploaded twice
  under the same key, which is consistent with content-addressing
  semantics.

### Backups

Backups are where encryption choices pay off or hurt:

- **Block-level snapshot / cross-region replication.** The provider
  handles it. A snapshot of a Layer-A-encrypted EBS volume is
  ciphertext; restoring requires the KMS key. Cross-region
  replication requires the key to be grant-accessible in the target
  region. Free; handled by the provider.
- **Application-level backup service.** A backup service holds a
  `Store` or `Directory` cap, reads objects, writes them to an
  object-storage bucket, and records the backup manifest. If Layer B
  is in place, the backup bytes are already encrypted — no
  re-encryption needed, and the backup destination does not need the
  user's key. If only Layer A is in place, the backup service sees
  plaintext because Layer A wraps below the Directory; the backup
  service must re-encrypt for the destination.
- **Restore to a different account / region / capOS install.** The
  key must be reachable in the target environment. For KMS-wrapped
  DEKs: cross-account grants, multi-region KMS keys, or replicated
  key material. For TPM-sealed DEKs: explicit re-seal to the target
  TPM before restore. capOS does not need to implement this
  directly; it needs the `KeySource` abstraction to not hide the
  provider-specific primitives that enable it.

A backup `KeyPolicy` worth documenting: "this key is usable in
regions A, B, and C, wrapped under KMS keys `k_a`, `k_b`, `k_c`, all
granting access to the instance identity role `backup-reader`." This
is routine on AWS and routinely surprising to people who expect Linux
dm-crypt semantics.

### Keys never in the image

The capOS ISO must never contain production keys. The
`ManifestEmbeddedKeySource` (key-management proposal) exists for
development and CI only; the manifest validator should refuse to boot
from an image that embeds a non-development key on a
production-profile manifest. The production flow is always: boot from
a key-less image, obtain identity from the cloud, fetch the wrapping
policy from the cloud, unwrap a DEK via KMS, mount the volume. Same
property as AWS's "EBS with KMS requires no bootstrap secrets on the
instance."

### Confidential computing

SEV-SNP, TDX, and AWS Nitro Enclaves produce attestation reports that
include measurements of the VM image. A KMS policy can require a
matching attestation before releasing the wrapping key. In capOS:

- `AttestationService` exposes `attestation(nonce) -> report` (the
  report includes the image measurement, firmware version, and VM
  metadata signed by the hardware root of trust).
- `KeySource` of kind `attestation` collects the report and submits it
  as part of the KMS `Decrypt` request; KMS enforces the policy
  server-side.
- The trust story becomes: "this capOS image, unmodified, running on
  genuine SEV-SNP / TDX / Nitro hardware, is the only thing that can
  unlock this volume." That is materially stronger than
  instance-identity alone.

This composes cleanly with Layer A: the confidential VM reads
ciphertext from a cloud disk, unwraps the DEK via attestation-gated
KMS, and decrypts locally. The cloud provider never sees plaintext and
a stolen snapshot cannot be decrypted outside the attested VM.

## Phases

No implementation exists. Phases here cover only the volume-specific
work; the underlying key abstractions, key sources, and KMS
integration are phased in
[cryptography-and-key-management-proposal.md](cryptography-and-key-management-proposal.md).
Volume encryption tracks, but does not duplicate, that sequence.

### Phase V1 — `EncryptedBlockDevice` over RAM block device

- Add `EncryptedBlockDeviceFactory`, `VolumeFormat`, `TagAreaLayout`,
  and `FormatParams` to `schema/capos.capnp`.
- Wire the service between a RAM-backed `BlockDevice` and the Store
  or a toy FAT reader. Key source is `ManifestEmbeddedKeySource` from
  the key-management proposal's Phase 1.
- Implement AES-256-GCM-SIV with a reserved tag area; document the
  on-disk format (superblock, tag area layout, block size).
- Measurement: demonstrate a Store survives a ciphertext read of the
  raw RAM disk and fails decrypt after a flipped bit.

### Phase V2 — `EncryptedNamespace` and user-volume path

- Add `EncryptedNamespaceFactory` schema.
- Layer B over a RAM-backed Store. Depends on
  `PassphraseKeySource` (key-management Phase 4) and
  `PasskeyPrfKeySource` once passkey infrastructure lands.
- Revocation tests: dropping a session's key cap renders the
  namespace unreadable without rebooting.

### Phase V3 — Persistent storage integration

- Promote Phase V1 from RAM disk to virtio-blk.
- System volume unlock in the normal boot path. Default dev build
  uses a manifest-embedded key; production build requires
  passphrase/TPM/KMS.
- QEMU smoke: system volume encrypted with a passphrase, reboot
  survives, wrong passphrase fails closed.

### Phase V4 — TPM-backed system volume

- Depends on `Tpm2KeySource` from key-management Phase 5.
- Measured-boot chain: firmware, bootloader, kernel, init, key
  service. PCR composition for a sealed system volume documented.

### Phase V5 — Cloud deployment

- Depends on `CloudKmsKeySource` from key-management Phase 6.
- Client-side encrypted block volume over cloud block storage.
- Optional: client-side encrypted Store backend over object storage.

### Phase V5b — OIDC-federated unlock

- Depends on `OidcFederatedKeySource` from key-management Phase 6b
  and on `WorkloadIdentityFederation` from
  [oidc-and-oauth2-proposal.md](oidc-and-oauth2-proposal.md) Phase 5.
- System volume unlocks through token-exchange against the cloud
  STS; no long-lived IAM credentials in the image.
- Per-user `EncryptedNamespace` unlocks from a user `AccessToken`
  under `SealPolicy.tokenExchange`.
- QEMU smoke against a local fake STS (e.g. `dex`) proves the flow
  end-to-end before targeting a real cloud.

### Phase V6 — Confidential computing

- Depends on `AttestationKeySource` from key-management Phase 7.
- Attestation-gated system volume unlock on SEV-SNP / TDX / Nitro.
- QEMU SEV-SNP smoke (where toolchain supports it).

## Relationship to Other Proposals

- **`cryptography-and-key-management-proposal.md`** — primary
  dependency. Defines `SymmetricKey`, `KeySource`, `KeyVault`,
  `KeyAlgorithm`, `KeyPurpose`, `SealPolicy`, and every concrete key
  source this proposal names. This proposal adds only the volume
  wrapper factories and on-disk format.
- **`storage-and-naming-proposal.md`** — Open Question #5 (manifest
  trust and secure boot) is a prerequisite for a TPM-sealed
  `KeySource` to be meaningful. This proposal extends the storage
  stack with `EncryptedBlockDevice` and `EncryptedNamespace` as
  optional wrapper services; the `BlockDevice`, `File`, `Directory`,
  `Store`, and `Namespace` interfaces are unchanged.
- **`boot-to-shell-proposal.md`** — the passphrase / passkey unlock
  path at the console and in the web gateway feeds `KeySource`
  implementations. `CredentialStore`, `SessionManager`, and
  `AuthorityBroker` already think about missing credentials not
  implying an unlocked system; this proposal extends that to "missing
  key source implies missing system volume, not zero-fill."
- **`user-identity-and-policy-proposal.md`** — user-volume keys are
  bound to session identity. The cap chain that yields "you are
  Alice" also yields Alice's KEK.
- **`cloud-metadata-proposal.md`** — `CloudMetadata` and the
  `InstanceIdentity` cap carved out of it are what the cloud
  `KeySource` implementations consume to authenticate to KMS without
  baked-in credentials.
- **`oidc-and-oauth2-proposal.md`** — the `WorkloadIdentityFederation`
  and token-exchange primitives behind `OidcFederatedKeySource`. Also
  the source of the `AccessToken` / `IdToken` cap shape used in
  per-user volume unlock and the policy inputs consumed by
  `SealPolicy.tokenExchange`.
- **`cloud-deployment-proposal.md`** — hardware abstraction for
  NVMe and SED drives sets the ground for a future
  `SelfEncryptingBlockDevice` capability (hardware inline crypto),
  distinct from this proposal's software-crypto Layer A.
- **`security-and-verification-proposal.md`** — the encrypted block
  format is a good target for the tiered tooling plan: fuzz corrupted
  ciphertext at the block boundary, proptest round-trips through the
  wrapper, Loom-model the volume unlock state machine, Kani-prove
  LBA-nonce uniqueness invariants. General crypto-side invariants are
  tracked in the key-management proposal.
- **`system-monitoring-proposal.md`** — volume unlock, decrypt
  failure, and format-params events are audit-worthy. The
  `EncryptedBlockDevice` service emits them through the audit cap.
  Generic key events are emitted by the key-management services.
- **`live-upgrade-proposal.md`** — replacing the
  `EncryptedBlockDevice` service must preserve in-flight I/O and the
  DEK. The service holds sensitive state (the key material); live
  upgrade needs a state-transfer path that does not touch the disk
  and does not leak the key through shared memory.

## Open Questions

1. **Tag area layout.** Sidecar journal (dm-integrity style, separate
   device or partition) vs. reserved footer per block group vs.
   derived-nonce-only-plus-separate-MAC-area. Affects write
   amplification, recovery, and fsync semantics. A small measurement
   study under QEMU would settle it.
2. **Key rotation at scale.** Rewrap-only (KEK rotation) is cheap.
   Rekeying a DEK on a live volume means re-encrypting every block.
   Online rekey is a research problem; for capOS a controlled offline
   rekey service reading old-key and writing new-key is the honest
   first answer.
3. **Metadata leakage in Layer B.** fscrypt-style filename encryption
   is fiddly (deterministic encryption to preserve directory lookups
   vs. randomized encryption that breaks them). Decide whether
   Layer B encrypts names as well as contents, and how lookups work
   if names are randomized.
4. **Backup re-encryption.** A backup crossing trust boundaries needs
   either shared key material at both ends or an explicit re-encrypt
   step. Who does the re-encryption — the backup service, a dedicated
   re-encryption service, or a KMS-side primitive? Policy question,
   not a mechanism question, but worth documenting defaults.
5. **Hardware inline crypto as a separate capability.** NVMe OPAL and
   SED drives do not fit the software-AEAD model. Define
   `SelfEncryptingBlockDevice` with its own `open`/`lock`/`unlock`
   methods and a separate trust story (the device is in the TCB).
6. **Swap / paging.** No swap yet. When added, encrypted swap with a
   per-boot ephemeral key is standard. The memory-pressure policy,
   page-eligibility rules, and swap lifecycle now live in
   [oom-and-swap-proposal.md](oom-and-swap-proposal.md).
7. **Firmware and boot-partition integrity.** This proposal assumes
   secure boot / measured boot is available when TPM-sealed keys are
   in use. The actual secure-boot work is owned by
   `storage-and-naming-proposal.md` Open Question #5 and is
   *prerequisite*, not in scope here.

Algorithm enum scope, side-channel hardening, post-quantum migration,
GOST support, and audit granularity are answered in
[cryptography-and-key-management-proposal.md](cryptography-and-key-management-proposal.md)'s
open-questions section rather than duplicated here.
