# Proposal: Durable Hardware Audit Log Persistence

How the `HardwareAuditLog` capability moves from a bounded volatile in-kernel
ring to durable, tamper-evident audit storage without claiming authority it
does not have.


## Problem

`HardwareAuditLog` today is a read-only observer over the four hardware
authority caps (`DeviceMmio`, `Interrupt`, `DMAPool`, `DMABuffer`). The kernel
emits one `cap-audit:` line per lifecycle event and appends a copy into a
fixed-size volatile ring (capacity 64, drop-oldest). A manifest-granted snapshot
method exposes a bounded window (cursor, truncation labels, drop counter) to a
single subscriber. The snapshot result already advertises its own limits:
`persistence_status=volatile-only`, `signature_status=unsigned`,
`subscriber_admission_status=production-admission-policy-not-implemented`.

That is honest groundwork, but it has three gaps that block any claim of
durable audit evidence:

1. **No durability.** The ring lives in kernel RAM. A reboot, crash, or QEMU
   exit loses every record. Audit evidence that vanishes on restart cannot
   support post-incident review.
2. **No tamper-evidence.** Records are unsigned and unchained. A future
   consumer reading persisted bytes cannot tell whether history was edited,
   reordered, or truncated.
3. **No production admission policy.** Exactly one manifest-granted reader gets
   a volatile snapshot. There is no model for multiple subscribers, scoped
   read windows, or revocation.

This proposal selects a concrete design for all three so the blocked task
`docs/tasks/blocked/ddf-audit-cap-durable-persistence.md` can proceed to
implementation. It is a design decision, not a kernel change.


## Scope and Non-Claims

This proposal is deliberately narrow. It is **observer-evidence design only**.

- Audit persistence records authority events. It does **not** grant, gate, or
  imply authority. The authority checks stay in the device-manager and
  cap-object paths exactly where they are now.
- Durable audit is **not** IOMMU isolation. It does not bound DMA, validate
  MMIO ranges, or constrain interrupt routes. It records that those events
  happened.
- Durable audit is **not** provider-driver readiness. A persisted audit trail
  does not make a userspace driver production-ready; it makes the driver's
  hardware-cap lifecycle reviewable.
- Tamper-evidence is **detection**, not prevention. A signed, hash-chained log
  proves history was *not* edited if verification passes; it cannot stop a
  privileged writer from refusing to append. Availability of the audit path is
  a separate concern.
- The durable path must **not** depend on volatile QEMU-only state, the
  `qemu` cargo feature proof rings, or local run telemetry. Those remain
  harness scaffolding.


## Design Grounding

- `docs/tasks/blocked/ddf-audit-cap-durable-persistence.md` — acceptance
  criteria and hazard preflight this proposal answers.
- `docs/proposals/cryptography-and-key-management-proposal.md` — `SymmetricKey`
  (`mac`/`verify`), `PrivateKey` (`sign`), `KeySource`, and `KeyVault`
  primitives consumed for tamper-evidence and key lifecycle.
- `docs/proposals/storage-and-naming-proposal.md` — capability-native `Store`,
  append-only `File`/ledger semantics, content hashing, previous-record hash
  chaining, and stale-write rules consumed for the durable ring.
- `docs/proposals/system-monitoring-proposal.md` — audit as a distinct
  append-only record type with its own readers and retention, X.740 audit
  field model, and "observation is authority" principle.
- `docs/dma-isolation-design.md` and
  `docs/plans/device-driver-foundation.md` — the device-driver foundation
  context the hardware authority caps live in.
- `kernel/src/cap/hardware_audit.rs` — the current volatile-ring behavior this
  design preserves and extends.


## Design

### 1. Durable Audit-Record Ring

The durable audit path is a **two-tier** structure: the existing bounded
in-kernel volatile ring stays as a fast-path staging buffer, and a userspace
**audit log service** owns durable persistence behind the capability-native
`Store` interface.

```mermaid
flowchart LR
    DM[Device manager and<br/>hardware cap objects] -->|emit_cap_audit| KR[Kernel volatile ring<br/>capacity 64, drop-oldest]
    KR -->|snapshot cursor poll| ALS[Audit log service<br/>userspace]
    ALS -->|append-only records| ST[(Store / append-only<br/>ledger segment)]
    ALS -->|sealed segment digest| KV[KeyVault / KeySource]
    ALS -->|scoped read window| SUB[Admitted subscribers]
```

**Why a userspace service, not kernel-side disk I/O.** Durable storage means a
block device, a filesystem-like layout, segment rotation, and signing. None of
that belongs in the kernel: the kernel's job is dispatch and isolation. The
kernel keeps doing exactly what it does today — bounded, alloc-free,
lock-light ring emission — and a userspace audit log service drains it through
the existing snapshot cursor. This also keeps the durable path off any
QEMU-only state: the service persists through `Store`, which is backed by a
real `BlockDevice` (or a cloud bridge adapter) per the storage proposal.

**Drain protocol.** The audit log service polls `HardwareAuditLog.snapshot`
with a monotonic `start_sequence` cursor. Each poll returns the window since
the last durably-committed sequence. The service:

1. Reads the snapshot window and the `dropped_records` counter.
2. Appends each record to the current segment (see rotation below).
3. Advances its cursor to `next_sequence` only after the segment write is
   durably committed (`Store` sync).

If the kernel ring drops records between polls (`dropped_records` advanced by
more than the records the service consumed), the service writes a **gap
marker** record into the durable log: `{ kind: gap, lost_count, observed_at }`.
A gap is itself audit evidence — it is recorded, not hidden. The drop-oldest
behavior of the kernel ring is therefore preserved and made *visible* in the
durable log rather than silently lost.

**Retention and rotation.** The durable log is a sequence of fixed-size
**segments** (proposed 1 MiB each; an implementation tuning parameter, not an
ABI). When a segment fills:

1. The service computes the segment digest (see tamper-evidence below).
2. It seals the segment (digest + chain link recorded).
3. It opens the next segment, whose first record carries the previous
   segment's digest as `prev_segment_digest`.

Retention is **count-bounded and age-bounded**: keep at most `N` sealed
segments (proposed default 64) or segments newer than `T` (proposed default 30
days), whichever is smaller. The bound is a manifest-configurable policy on the
audit log service, not a kernel constant.

**Overflow policy.** Two distinct overflow points, two distinct policies:

- **Kernel ring → service drain lag.** Drop-oldest, as today, with a recorded
  gap marker. Rationale: the kernel ring must never block a hardware cap
  lifecycle path on a slow or absent consumer. Audit emission is best-effort by
  construction; the gap marker makes the loss auditable.
- **Durable segment retention limit.** Drop-oldest **sealed segment**, with a
  retention-eviction record appended to the active segment naming the evicted
  segment's digest and sequence range. Rationale: an operator querying "what
  did we lose to retention" gets a definite answer, and the hash chain stays
  intact across the eviction (the eviction record links forward; the evicted
  segment's digest is permanently recorded before deletion).

Backpressure is explicitly **rejected** for both points. Backpressuring a
hardware authority cap on audit-storage latency would let a stalled disk wedge
device lifecycle — an availability and correctness hazard far worse than a
recorded gap. Audit is evidence over authority, never a gate on it.

**Crash-recovery semantics.** On audit log service restart:

1. The service scans sealed segments oldest-to-newest, verifying each
   segment digest and the `prev_segment_digest` chain link.
2. It finds the last segment. If the last segment is unsealed, it replays its
   records, recomputing the running digest; a torn final record (incomplete
   write) is truncated at the last valid record boundary and a
   `recovery_truncation` marker is appended.
3. It re-derives the drain cursor from the highest durably-committed
   `sequence` and resumes polling the kernel ring from there.

Records lost in the window between the last durable commit and the crash are
**not recoverable** — the kernel ring is volatile and a crash loses it. This is
an explicit, accepted limitation: see Assumptions. The recovery markers make
the boundary of trustworthy history explicit to any consumer.

### 2. Tamper-Evidence and Signing

Tamper-evidence is a **hash chain plus segment signing**, consuming the
cryptography/key-management proposal's primitives. No new crypto is invented
here.

**Per-record chaining.** Each durable audit record carries
`prev_record_hash` — a hash over the previous record's canonical bytes. This is
exactly the append-only-ledger pattern the storage proposal already
prescribes ("append new records with previous-record hashes rather than
rewriting history"). Editing or reordering any record breaks every subsequent
`prev_record_hash`, so a verifier walking the chain detects the first
divergence.

**Per-segment signing.** When a segment is sealed, the audit log service
computes the segment digest (a hash over the sealed record range, anchored on
the running chain hash) and produces a signature over
`{ segment_index, sequence_range, record_count, segment_digest,
prev_segment_digest }`. Two signing modes, selected by manifest policy:

- **MAC mode (default).** A `SymmetricKey` with `KeyPurpose.integrity` produces
  an HMAC tag over the segment header via `SymmetricKey.mac`. Cheaper, no
  asymmetric key handling, sufficient when the verifier is trusted to hold the
  same key. Verification is `SymmetricKey.verify`.
- **Asymmetric mode.** A sign-only `PrivateKey` produces a signature via
  `PrivateKey.sign`. Used when audit evidence must be verifiable by a consumer
  that should *not* be able to forge records (e.g. an external reviewer holding
  only the public key). Verification uses the corresponding `PublicKey.verify`.

The audit log service receives a signing-capable key cap (a `SymmetricKey`
restricted to `mac`, or a `PrivateKey` restricted to `sign`) at manifest grant
time. It never holds raw key material — the key is a capability object per the
key-management design.

**What signs what.** The chain hash protects record *order and content* within
and across segments. The segment signature protects the *segment header*,
binding the digest, sequence range, and previous-segment digest under a key.
Together: a verifier with the verification key can confirm that the sealed
segments form an unbroken, unedited chain back to the first segment, and that
each seal was produced by the holder of the signing key.

**Key lifecycle.**

- **Provenance.** The signing key is produced by a `KeySource` and stored
  sealed in a `KeyVault` (per the key-management proposal). The manifest grants
  the audit log service a *use* capability for the key, not the vault.
- **Rotation.** Keys rotate on a policy interval (proposed default 90 days) or
  on demand. Rotation is segment-aligned: a segment is always signed by exactly
  one key. The first segment after rotation records a `key_rotation` marker
  carrying the new key's identifier (`KeySource.info` identifier — a label,
  not a secret) and the previous key's identifier. A verifier follows the
  identifier sequence to know which key verifies which segment range.
- **Revocation.** If a signing key is suspected compromised, it is revoked in
  the `KeyVault`. Revocation does **not** invalidate already-sealed segments —
  those remain verifiable against the (now-revoked) key, and the revocation
  itself is recorded as a `key_revocation` marker. What revocation prevents is
  *future* seals with that key. A consumer treats segments signed by a revoked
  key as "authentic at seal time, key later revoked" — still evidence, with a
  documented caveat.
- **What is NOT protected.** Tamper-evidence cannot protect records the kernel
  ring dropped before the service drained them, cannot protect the
  crash-window records, and cannot prevent an attacker who holds the live
  signing key from forging *new* well-formed history going forward. It detects
  edits to *already-sealed* history. These limits are stated in Assumptions.

### 3. Production Subscriber Admission Policy

Today exactly one manifest-granted reader gets a volatile snapshot. The
production model keeps "observation is authority" but adds structure.

**Reader caps are typed and scoped.** The audit log service exposes readers as
distinct capability objects, not a single shared snapshot method:

- `HardwareAuditReader` — a read-only cap over a **scoped window**: a
  subscriber may be granted the full history, a single hardware-cap-tag slice
  (e.g. `DMAPool` events only), or a bounded recent window. Narrowing is
  structural — a narrower reader is a wrapper cap exposing less, per the capOS
  capability-model principle, not a rights bitmask.
- The cap exposes `snapshot` (cursor-based, preserving the existing
  field model) and `verify` (returns segment-chain verification status so a
  subscriber can confirm tamper-evidence without holding the signing key, when
  the deployment uses asymmetric mode and grants the public verification key).

**Admission is manifest-declared, with a runtime broker path.** Two tiers:

- **Manifest-declared subscribers.** The boot manifest declares which services
  receive which scoped reader caps, exactly like every other capability grant.
  This is the baseline and covers the monitoring/audit service itself.
- **Runtime-admitted subscribers.** A later phase may route audit-reader
  requests through the userspace authority broker
  (`docs/proposals/userspace-authority-broker-proposal.md`), so an operator
  session can be granted a scoped, time-bounded reader without a reboot. This
  is explicitly future work, gated on the broker; it is named here so the
  admission model has a forward path, not so it ships in the first slice.

**Revocation.** Reader caps are ordinary caps and are revoked the ordinary way
(cap-table teardown). Revoking a reader does not touch the durable log.

### 4. Preservation of Existing Volatile-Snapshot Behavior

The kernel-side volatile ring and its snapshot ABI are **preserved unchanged**
as the staging tier:

- The bounded ring (capacity 64), `head`/`len`/`next_sequence`/`dropped_records`
  bookkeeping, and drop-oldest admission stay exactly as in
  `kernel/src/cap/hardware_audit.rs`.
- The snapshot cursor (`start_sequence`), truncation labels
  (`no-records-requested`, `request-limited`, `snapshot-limit-limited`,
  `available-records-exhausted`), and the `dropped_records` counter stay as the
  drain protocol between kernel and audit log service.
- The QEMU-only proof rings and `prove_qemu_snapshot_truncation_contract`
  remain harness scaffolding and are not on the durable path.
- The snapshot result's self-describing status fields stay, and their values
  advance as the durable path lands: `persistence_status` moves from
  `volatile-only` to a value naming the durable tier, `signature_status` from
  `unsigned` to the active signing mode, and `subscriber_admission_status` from
  `production-admission-policy-not-implemented` to the active policy. Changing
  those field *values* is an ABI-adjacent change and must land with schema,
  generated bindings, runtime decode, demos, and smoke assertions in one
  branch, per the task hazard preflight.

No focused hardware-audit smoke is invalidated by this design: the kernel-side
behavior they assert is unchanged. New durable-path behavior gets new smokes
(see Evidence Expectations in the task file).

### 5. Assumptions

The durable evidence is trustworthy only under stated assumptions. A consumer
must know these before trusting the log.

- **Crash window is lossy.** Records in the kernel volatile ring that were not
  yet durably committed by the audit log service are lost on a crash or power
  loss. The durable log's recovery markers bound trustworthy history; they do
  not recover the lost window. Audit is best-effort at the volatile staging
  tier by design — it must never block hardware cap lifecycle.
- **Rollback below the audit log is out of scope.** This design assumes the
  `Store`/`BlockDevice` beneath the audit log service does not silently roll
  back committed segments. If the underlying storage can roll back (e.g. a
  snapshot-restore of the whole volume), the hash chain detects the resulting
  gap on next verification, but the design does not *prevent* it. Volume-level
  rollback protection is the volume-encryption/storage proposals' concern.
- **Rotation is segment-aligned and monotonic.** A segment is signed by exactly
  one key. Key identifiers in `key_rotation` markers are assumed monotonic and
  unique so a verifier can deterministically map segment ranges to keys.
- **Key lifecycle is delegated.** Key generation, sealing, rotation scheduling,
  and revocation are the `KeySource`/`KeyVault` services' responsibility. This
  proposal assumes those primitives behave as the key-management proposal
  specifies; it does not re-implement them.
- **Signing key compromise forges the future, not the past.** An attacker
  holding the live signing key can produce well-formed *new* records. The hash
  chain plus revocation marker make the compromise *boundary* detectable once
  revocation is recorded, but records sealed during the compromise window are
  only as trustworthy as the key was. Asymmetric mode narrows this: a verifier
  holding only the public key cannot itself forge, but a compromised private
  key still can until revoked.
- **The audit log service is trusted to append.** Tamper-evidence detects
  edits to sealed history. It does not prevent the audit log service from
  refusing to append, stalling, or being killed. Availability of the audit
  path — restart policy, health checks — is the service-architecture and
  monitoring proposals' concern, not this one.


## Relationship to Other Proposals

- **Cryptography and Key Management** — this proposal *consumes*
  `SymmetricKey.mac`/`verify`, `PrivateKey.sign`, `KeySource`, and
  `KeyVault`. It adds no cryptographic primitive.
- **Storage and Naming** — the durable ring is an append-only ledger on the
  capability-native `Store`, using the previous-record-hash chaining the
  storage proposal already prescribes.
- **System Monitoring** — the audit log service is the hardware-cap-specific
  producer feeding the broader audit-record model in the monitoring proposal;
  scoped `HardwareAuditReader` caps follow the monitoring proposal's
  "observation is authority" and per-record-type retention principles.
- **Device Driver Foundation** — this design records hardware authority cap
  lifecycle events. It does not change where authority is checked, and does not
  claim provider-driver readiness or IOMMU isolation.


## Open Questions

- Segment size, retention counts, and rotation interval are proposed defaults,
  not ABI. They want a tuning pass once a real `BlockDevice` backend exists.
- Whether the `verify` method on `HardwareAuditReader` should return a full
  chain proof or a bounded status summary depends on the first real consumer's
  needs and is deferred to implementation.
- Cloud-bridge-backed `Store` for the durable log inherits the storage
  proposal's stale-write and size-bound rules; whether audit segments should
  also be content-addressed objects in that backend is left to the storage
  track.