# DMA Isolation Design

Security Verification Track S.11 gates PCI, virtio, and later userspace
device-driver work on an explicit DMA authority model. The immediate goal is
narrow: let the kernel bring up a QEMU virtio-net smoke without creating a
user-visible raw physical-memory escape hatch.


## Short-Term Decision

Use **kernel-owned bounce buffers** for the first in-kernel QEMU virtio-net
smoke.

The first virtio-net smoke stays on this conservative path:

- kernel-owned DMA pages
- kernel-owned virtqueue descriptor tables
- kernel-owned packet buffers
- kernel-programmed physical addresses
- copied packet bytes delivered to the network stack
- no DMA buffer capability exposed to userspace
- no physical address exposed to userspace
- no virtqueue pointer exposed to userspace
- no BAR mapping exposed to userspace

The kernel allocates DMA-capable pages from its own frame allocator, owns the
virtqueue descriptor tables and packet buffers, programs the device with the
corresponding physical addresses, and copies packet payloads between those
buffers and the networking stack.

This is deliberately conservative:

- It works before ACPI/DMAR or AMD-Vi parsing, IOMMU page-table management,
  MSI/MSI-X routing, and userspace driver lifecycle supervision exist.
- It keeps all physical-address programming inside the kernel, where the same
  code that allocates the frames also bounds the descriptors that reference
  them.
- It does not make the current `FrameAllocator` or `MemoryObject` capability
  part of the DMA path. `FrameAllocator` no longer exposes raw physical
  addresses, but DMA still needs device-owned buffer objects with IOVA and
  reset/revoke semantics rather than repurposed general memory caps.
- It gives the smoke a disposable implementation path. When NIC or block
  drivers move to userspace, bounce-buffer authority becomes a typed
  `DMAPool` object instead of an ad hoc physical-address grant.

An IOMMU-backed DMA-domain model remains the target for direct device access
from mutually untrusted userspace drivers, but it is not a prerequisite for
the first QEMU smoke. Without an IOMMU, a malicious bus-mastering device can
still DMA to arbitrary RAM at the hardware level; the short-term smoke assumes
QEMU-provided virtio hardware and protects against confused or untrusted
userspace, not hostile hardware.

## IOMMU Staging

IOMMU support is a deferred-with-known-dependency prerequisite for production
hardware claims and for moving direct DMA-capable NIC or block drivers into
userspace. capOS now discovers bounded ACPI IOMMU table summaries for Intel
DMAR and AMD-Vi/IVRS and records static DMAR DRHD include-all or single-hop
PCI endpoint device-scope coverage for retained DMA-capable PCI diagnostics
functions. Bridge and multi-hop scopes are retained for diagnostics but do not
prove endpoint attachment until PCI topology traversal exists, and include-all
fallback fails closed when retained DRHD units or scopes are capped.

The selected QEMU Intel remapping path now programs VT-d root/context and
second-level tables for manager-owned `DMAPool` pages, reports bounded fault
state, exports only domain-scoped IOVAs, and proves two claimed DMA-capable
functions receive distinct per-device domains and second-level roots. It also
asserts the production-path S.11.2 hostile-smoke matrix over the active
`DMAPool` / `DMABuffer` ledger. The decomposed integration umbrella for this
path closed `2026-05-23 23:35 UTC`
([`ddf-iommu-remapping-production-closeout`](tasks/done/2026-05-23/ddf-iommu-remapping-production-closeout.md)).
This is still QEMU-only evidence for the
selected path, not a general production
hardware-isolation claim: trusted sharing groups, AMD-Vi programming, and
production NIC/block userspace driver authority remain future work, and VM
shapes without usable remapping hardware remain on the explicit bounce-buffer
fallback.

The discovery parser is intentionally shallow and follows the static-table
formats documented by the Intel VT-d architecture specification, the AMD IOMMU
specification, and QEMU's q35-only `-device intel-iommu` emulation:

Future real remapping work is grounded by the primary-source
[IOMMU remapping research note](research/iommu-remapping.md), which records
Intel VT-d, AMD-Vi, and QEMU sections relevant to table programming,
invalidation, fault/status diagnostics, and QEMU-only smoke tests. That note is
source grounding only; it does not make the current diagnostics path a real
remapping implementation.

- Intel VT-d architecture specification:
  <https://www.intel.com/content/www/us/en/content-details/671081/intel-virtualization-technology-for-directed-i-o-architecture-specification.html>
- AMD IOMMU specification:
  <https://docs.amd.com/v/u/en-US/48882_IOMMU>
- QEMU manpage:
  <https://www.qemu.org/docs/master/system/qemu-manpage.html?highlight=numa>

The staged implementation order is:

1. Discover firmware IOMMU topology from ACPI static tables and fail closed if
   the tables are malformed, unsupported, or inconsistent with the PCI root
   complex being used. This first bounded table-discovery step is implemented
   for DMAR/IVRS summaries only; domain attachment is still planned.
2. Record each DMA-capable PCI function's attachment to an IOMMU unit, or
   explicitly keep the function on the prototype bounce-buffer-required policy
   when no trusted IOMMU domain can be created. This reporting step is
   implemented for retained PCI
   diagnostics functions when DMAR DRHD include-all or single-hop PCI endpoint
   device-scope metadata proves PCI segment/BDF coverage. Bridge and
   multi-hop scopes are not treated as attachment proof until PCI topology
   traversal exists, and include-all fallback fails closed when retained DMAR
   coverage metadata is capped; trusted domain creation is still planned.
3. Define and prove the claimed-device domain policy: one
   device-manager-owned DMA domain per claimed device or trusted sharing group,
   with all exported device addresses represented as IOVAs scoped to that
   domain rather than host physical addresses. The selected QEMU Intel path now
   implements the per-device form for two claimed DMA-capable functions;
   trusted sharing groups remain disabled and out of scope.
4. Attach `DMAPool` allocation, descriptor validation, MMIO ownership,
   interrupt ownership, and revocation state to the same device-manager ledger
   before any doorbell write can make a descriptor visible to hardware.
5. On revoke, reset, or driver death, stop new submissions, remove or
   invalidate IOMMU mappings before page reuse, and flush the relevant IOTLB
   state where the hardware model requires it.

Until those gates exist, direct DMA and userspace driver handoff remain
blocked. Devices that cannot be placed in a trusted IOMMU domain must stay on
kernel-owned bounce buffers or remain unsupported for production claims. This
also affects the hostile-smoke gate: S.11.2 smokes must prove that stale DMA
handles, stale completions, reset races, and teardown ordering fail closed for
IOMMU-backed IOVA mappings, while the process-exit / exit-under-DMA rows remain
covered by the selected backend evidence before a cloud or hardware driver can
be treated as isolated from the rest of memory.

### Fallback Policy For No Usable IOMMU Exposure

Some providers or VM shapes may not expose remapping hardware that capOS can
trust. That includes absent, malformed, unsupported, capped, or incomplete
DMAR/IVRS metadata; scopes that require PCI topology traversal capOS has not
implemented yet; and platforms where remapping hardware is unavailable or
cannot be programmed safely. Those shapes use a fail-closed fallback policy:

- Direct device DMA remains blocked. `direct_dma_trusted_domains` stays zero
  and `remapping_tables` stays `not-programmed`.
- Prototype devices that remain enabled use kernel-owned bounce buffers only.
  The kernel or device manager owns the pages, descriptor validation,
  physical-address programming, and packet or block-data copies between
  device-visible memory and non-device memory. General `FrameAllocator` and
  `MemoryObject` capabilities are not DMA authorities.
- capOS does not expose direct hardware authority for userspace `DMAPool`,
  `DMABuffer`, `DeviceMmio`, or `Interrupt` in the fallback shape. Result-only
  `.info` skeletons and bounded manifest grants may report conservative status.
  The current `DMAPool` manifest grant may allocate and free eight fixed
  manager-attached, kernel-owned, single-page bounce-buffer `DMABuffer` result
  caps, with backing pages scrubbed before frame release and no host physical
  address or IOVA exposed. That narrow fixed-slot allocation/free authority
  does not map DMA, program device-visible addresses, publish arbitrary CQ
  entries, program IOMMU/remapping tables, access arbitrary BAR registers or
  doorbells, or own hardware interrupt acknowledgement, mask, or unmask. The
  selected provider-TX proof is the current bounded exception: after the same
  manager-owned `DMABuffer` authority and bounce-scrub gates, queue `1` may
  publish the full selected TX queue-depth descriptor/avail window into the
  existing kernel-owned virtio-net TX ring before the first completion, ring one
  selected notify doorbell per accepted provider descriptor through the live
  no-write `notify_mmio` policy, and hand those bounded completions back through
  descriptor/generation-matched `DMABuffer.completeDescriptor` plus live
  `tx_interrupt.wait` completion events. The same selected path can also use
  `tx_interrupt.mask`/`unmask` to toggle only the selected TX MSI-X table
  vector-control bit and matching route state after live issue-id and route
  validation, and can retire one deferred LAPIC EOI for each delivered selected
  TX used-ring completion event, with `Interrupt.acknowledge` returning
  ABI-visible provider CQ/ack ledger fields plus hardware dispatch ack count,
  delta, token, and mutation flag for that bounded pairing. Full-queue QEMU
  bursts that coalesce selected TX MSI-X delivery use a bounded
  `INT $vector` proof hook only while the virtio TX completion path has an
  active full-window coalescing budget, so the selected IDT handler and
  deferred-EOI path remain observable without claiming full production IRQ
  ownership. Successful selected queue `1`
  `DMABuffer.completeDescriptor`, `tx_interrupt.wait`, and
  `tx_interrupt.acknowledge` results also carry bounded CQ event identity:
  sequence, queue, descriptor id, slot, slot generation, software descriptor
  generation, completion length, provider issue id, source id/generation, and
  route generation. Pre-event, duplicate ack, masked-route ack, wrong-order
  completion, teardown-drain, stale issue after release/regrant, reset, and
  stale-after-release paths keep that identity empty and do not mutate the
  bounded identity queue. Provider TX release also retires delivered but
  unacknowledged bounded CQ events for the live issue before clearing that issue:
  the stale post-release ack path is revoked, and the release proof records seven
  pending provider completion acks and their deferred EOIs as release-retired.
  The same selected path also has
  a bounded teardown-only drain for seven incomplete provider-published TX
  descriptors while one completed descriptor remains live: release may
  explicitly drain only the incomplete matching used-ring entries, retire those
  allocation-backed device-DMA TX queue ledgers, and free only after manager
  in-flight state is drained, without publishing provider CQ/IRQ events or
  issuing `DMABuffer.completeDescriptor` results. The paired provider RX
  bootstrap grant can now validate the live RX issue and selected virtio-net RX
  route before toggling only the selected RX MSI-X table vector-control bit and
  route state, and it can complete one selected-route RX `Interrupt.wait` after
  a delivered RX MSI-X/LAPIC dispatch. The paired `Interrupt.acknowledge`
  accounts exactly one RX dispatch token and retires one deferred LAPIC EOI for
  that delivered zero-CQ RX event; pre-event, masked-route, duplicate, and
  stale-after-release paths fail closed without mutating delivery or
  acknowledgement state. RX descriptor accounting and RX CQ ownership remain
  bounded to the synthetic proof path, and full hardware IRQ ownership remains
  blocked. These
  exceptions do not transfer full virtio-net ownership, direct DMA, IOMMU
  authority, arbitrary doorbells, production NIC/storage authority, or cloud
  readiness.
- capOS does not claim hostile-hardware isolation for those shapes. A malicious
  or compromised bus-mastering device without a trusted remapping domain can
  still write arbitrary RAM at the hardware level. The fallback is acceptable
  only for prototype devices and trusted emulator or provider shapes where that
  hardware threat is outside the claim; otherwise the device remains
  unsupported.
- Before any userspace driver path can rely on DMA or IRQ authority, S.11.2
  hostile smokes must pass for the selected backend. That includes stale DMA
  handles, stale completions, descriptor abuse, revoke/reset races, stale IRQs,
  teardown-under-DMA for IOMMU-backed IOVA mappings, and exit-under-DMA for
  the fallback bounce-buffer path when the fallback is used.

This fallback policy is separate from current diagnostics-only IOMMU metadata
coverage and from future real remapping-domain integration. Diagnostics can
report static firmware-table coverage for a PCI function, but unless capOS
creates a device-manager-owned remapping domain and programs mappings, the
active direct-DMA policy remains blocked. Future real integration must attach
`DMAPool`, `DeviceMmio`, `Interrupt`, ledger teardown, mapping removal or
invalidation, and required IOTLB flushes to the same ownership transaction
before a direct-DMA trusted-domain count can become nonzero.

## DMA Assurance Model Checked Evidence And Cloud Backend Inputs

The DMA assurance model records the claim boundary and checked bounded evidence
for DMA authority; the cloud backend contract it feeds is authoritative and
lives in the "Cloud DMA Backend" section below:
[`docs/proposals/dma-assurance-model-proposal.md`](proposals/dma-assurance-model-proposal.md)
and `models/dma/`. It is a design/evidence scaffold, not a new production
hardware gate by itself. The checked gates are `make model-dma-tla`,
`make model-dma-alloy`, `make kani-dma-authority`, and
`make model-dma-deferred-completion-loom`; `make dma-assurance-model-check`
aggregates them locally, while GitHub CI runs the Alloy/TLA+/Loom gates in
`dma-assurance-models` and the Kani gate in `kani-proofs`. The
operationalization track that reconciled the skeletons against landed DMA code
is tracked in
[`docs/backlog/security-verification.md`](backlog/security-verification.md)
("DMA Assurance Model Operationalization").

Cloud NIC/storage work must use the model as the checklist for backend
selection. Backend selection is a runtime, fail-closed decision the kernel
makes on each boot, with an optional operator override declared in the system
manifest; it is not a per-VM-shape safety assertion that a person signs off.
The authoritative selection rule and the manifest override contract are defined
in the "Cloud DMA Backend" section below.

Cloud backend evidence must separate provider-side DMA isolation from
guest-controlled remapping authority. SR-IOV, virtual NIC, GPU, accelerator,
or local NVMe support can identify a DMA-capable surface, but it is not enough
to claim direct-DMA isolation. A direct-remapping backend needs guest-visible
IOMMU or equivalent translation authority that capOS can discover and program.
The cloud evidence matrix must record provider API or documentation sources,
retrieval date, region or zone, instance type, image and kernel, live guest
PCI/device probes, IOMMU table/group observations, and maintenance or device
revocation behavior as the support-policy record for advertised targets. The
runtime probe, not this matrix, makes the binding per-boot selection.

The matrix does not replace runtime selection. capOS must choose the safest
backend on each boot from what it can actually observe and validate. Direct
remapping is enabled only when guest-programmable remapping authority is
present and passes the selected self-tests. A provider-remapped or bounce path
is selected only when direct DMA remains blocked and device-visible memory can
stay manager-owned. Ambiguous, contradictory, or unvalidated observations select
`Unsupported`.

The backend candidates are:

- **Direct remapping domain.** The provider shape must expose guest-programmable
  remapping hardware; capOS must discover and program a device-manager-owned
  domain for the target device; descriptor publication must be ordered after
  mapping; and teardown must remove mappings, observe required invalidation
  completion, and scrub before page reuse. The selected path must carry
  stale-handle, stale-completion, descriptor-abuse, revoke/reset-race,
  teardown-under-DMA, no-host-physical-exposure, and cross-domain alias
  evidence.
- **Labeled bounce-buffer fallback.** Direct DMA stays blocked, device-visible
  memory remains manager-owned bounce pages, host physical addresses and
  generic `MemoryObject` authority stay hidden from the driver, and stale
  handle/completion/teardown evidence covers the selected fallback. This path
  must keep `hostile_hardware_isolation=not-claimed` unless separate per-domain
  remapping evidence justifies a stronger provider-specific claim.
- **Unsupported.** Devices whose DMA behavior cannot satisfy either candidate
  stay unbound or disabled. A serial boot result or PCI enumeration line is not
  enough to claim cloud NIC/storage readiness.

Downstream cloud driver preflights must declare the candidate backend and map
their evidence to the assurance model's invariants: no host-physical exposure,
mapping before publication, no page reuse before teardown, stale-handle and
stale-completion fail-closed behavior, domain-scoped aliasing only, bounded
fail-closed holds, and explicit backend evidence. The evidence matrix is a
support-policy record of advertised targets; the runtime probe, not the matrix,
selects the backend on each boot.

## Cloud DMA Backend

This section is the authoritative contract for how capOS selects a DMA backend
for cloud NIC/storage devices. Selection is a runtime, fail-closed decision the
kernel makes on each boot from what it can actually probe and validate, with an
optional declarative override in the system manifest. There is no human
sign-off in the selection path: the runtime probe decides by default, and the
manifest override is config that an operator sets for a deployment, not a
doc-signing ritual gated on any specific person. Downstream cloud NIC/storage
driver slices consume this contract directly as their DMA-backend authority.

The preceding "DMA Assurance Model Checked Evidence And Cloud Backend Inputs"
section defines the three backend candidates; this section adds the
per-candidate trade-off
analysis, the runtime selection rule, the manifest override field, and the
downstream-contract scaffolding that a cloud NIC/storage driver declares. The
research substrate is the provider evidence inventory
[`docs/research/cloud-dma-provider-evidence.md`](research/cloud-dma-provider-evidence.md),
and the invariants and tool mapping are in
[`docs/proposals/dma-assurance-model-proposal.md`](proposals/dma-assurance-model-proposal.md).

### Provider-Written Addresses And No-IOMMU Brokered Bounce

Two DMA-address ownership models can be valid, but they do not apply to the same
backend.

- **Provider-written, kernel-validated addresses** (the NVMe Model B validator)
  are valid only when the provider's device-visible address is not a host
  physical address: a verified direct-remapping/vIOMMU domain-scoped IOVA, or a
  future synthetic software address namespace that the manager translates before
  hardware sees it.
- **Brokered address publication** is the no-IOMMU bounce-buffer model. The
  provider may own protocol state and buffer capabilities, but the kernel or
  device manager writes device-visible queue-base, PRP/SGL, or virtqueue address
  fields because those values are host physical or bus addresses on current
  no-IOMMU hardware.

Correction recorded 2026-05-27: the earlier reconciliation that treated a
no-IOMMU bounce window as a provider-visible, non-host-physical device address
space is not valid for the current implementation. On the `run-pci-nvme`
no-IOMMU shape, `DeviceDmaAllocation` carries host physical pages and the
reviewed IOVA export discipline keeps userspace IOVA/host-physical export
disabled. Therefore a provider-written NVMe queue base or PRP on that gate would
export a host physical address, violating the no-host-physical-exposure
invariant. A bounce buffer protects data ownership and copy discipline; it does
not create an untrusted-driver-safe IOVA namespace by itself.

The kernel on-notify DMA validator
(`kernel/src/cap/nvme_doorbell_validator.rs`, `validate_doorbell_scan`) remains
useful evidence for the provider-written model. On a queue-arm/`CC.EN` write and
on an SQ tail doorbell it scans the device-visible addresses the provider
published (queue bases; PRP1/PRP2 and one level of PRP-list indirection) and
fails closed before the doorbell takes effect on any address that is not wholly
within a window granted to the doorbell claim's owner at the live generation:
out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned,
deeper-than-one-level PRP chain, or stale generation (`ScanReject`). Owner
identity and live generation come from the grant ledger, never from
provider-supplied metadata. A completion whose submission scan was never
validated, or was validated under a now-retired generation, does not wake a
waiter (`completion_wakes_waiter`), matching the stale-completion gate on the
virtio-net path. That mechanism is the right fit for the QEMU direct-remapping
lane and any future cloud shape that exposes guest-programmable remapping.

For the current GCP/no-IOMMU target, the storage path must use brokered bounce:
userspace supplies typed commands, queue ownership intent, and live
`DMABuffer`/buffer-cap handles; the manager materializes the actual
device-visible queue-base and PRP/SGL fields and orders publication, teardown,
copy, and scrub. That still leaves protocol-specific NVMe logic in userspace,
but it does not let userspace author raw device addresses.

The brokered admin-queue enable landed 2026-05-27
(`nvme-no-iommu-brokered-controller-enable`,
`device_manager::nvme_brokered_admin_queue_enable`). The provider allocates the
admin submission/completion queue pages through its `DMAPool` cap and requests
enable through the `CC` selected-write claim (`CC.EN` set); the manager resolves
those pages from the live ledger (`proof_buffers` slots 0/1), validates the
authored bases through `validate_doorbell_scan` (`ScanKind::QueueArm`), and
authors `AQA`/`ASQ`/`ACQ` plus the `CC.EN` write itself. The provider never
receives the host physical / device-visible queue-base address; `CSTS.RDY=1` is
observed only through brokered reads. This is the brokered model applied to the
admin queue-arm; the steady-state SQ-tail doorbell over provider-written PRPs
still needs the direct-remapping/synthetic-address lane above. Proof
`make run-pci-nvme`; provenance `docs/devices/nvme.md` §6.

### Provider-Side Isolation Versus Guest-Programmable Remapping

The decisive distinction for backend selection is between a *DMA-capable
surface* and *guest-programmable remapping authority*:

- SR-IOV (AWS ENA, Azure Accelerated Networking VF), a virtual NIC (gVNIC,
  virtio-net), a GPU/accelerator, or local NVMe identifies a device that does
  or could bus-master. This is a DMA-capable surface, not a safety property.
- A **direct-remapping** classification requires a usable Intel VT-d, AMD-Vi,
  or Arm SMMU unit that the guest can discover, program, and validate, with
  translation/fault/invalidation behavior matching
  [`docs/research/iommu-remapping.md`](research/iommu-remapping.md). A
  DMA-capable surface alone never implies this.
- Provider-side isolation facts (host-enforced VPC isolation, Nitro/host
  data-path bypass, hypervisor-side IOMMU) are **support-policy assumptions** a
  guest cannot prove from inside, not evidence that capOS can safely program
  direct DMA.

Runtime probing is authoritative for selecting the safe backend on a particular
boot: capOS chooses from the device inventory, the remapping authority it can
actually program, driver self-tests, and fail-closed probe results, and unknown
or contradictory observations select the labeled bounce-buffer path or
`Unsupported`. The cloud VM evidence matrix is the separate support-policy
record for advertised targets and provider assumptions a guest cannot fully
prove by itself; it does not override the boot-time runtime selection.

### Candidate Trade-Off Analysis

| Dimension | Direct remapping domain | Labeled bounce-buffer fallback | Unsupported |
| --- | --- | --- | --- |
| IOMMU coverage requirement | Requires a guest-programmable VT-d/AMD-Vi/SMMU unit capOS can program and validate per device. | None: used precisely when no usable guest IOMMU is exposed. | N/A: device stays unbound. |
| Cloud VM shape coverage (per inventory) | No probed GCE shape exposes a guest-programmable IOMMU; AWS/Azure shapes not yet probed. So no probed shape currently qualifies. | Indicated for shapes with a DMA-capable surface but no guest IOMMU (the probed GCE rows); fail-closed default for unproven shapes with a manager-ownable surface. | Ambiguous, contradictory, or unvalidated observations. |
| Per-operation cost | Translation only; no data copy. IOTLB/context-cache invalidation on teardown. | Copy between device-visible bounce pages and non-device memory on every transfer; Confidential VMs force this in hardware regardless. | None. |
| Hostile-smoke coverage today | Bounded QEMU Intel path only (`make run-iommu-remapping`, `ddf-iommu-remapping-production-closeout`); no cloud guest-IOMMU evidence. | S.11.2.7/8/9 rows enforced by the `make run-net` gate (`tools/qemu-net-smoke.sh`), with bounce-buffer virtio-net provider evidence in `ddf-provider-virtio-net-driver-closeout`; bounce-buffer `DMAPool` lifecycle by `make run-dmapool-grant`. The GCP-shape local binding precursor (`cloud-gcp-virtio-net-local-qemu-binding`) asserts, in both `make run-net` and `make run-ddf-provider-consumer`, that the enumerated/bound function matches the documented GCP 1st/2nd-gen virtio-net device surface (standard virtio-net, vendor `0x1af4`) and that the resolved backend is this labeled bounce-buffer path; it does not claim live GCP enumeration. | N/A. |
| Hostile-hardware isolation claim | Claimable only with per-domain remapping evidence and the IOMMU hostile smokes; not yet established for any cloud shape. | `not-claimed`: a malicious bus-mastering device without a trusted remapping domain can still write arbitrary RAM. | N/A. |

The GCE live-probe rows in the evidence inventory record that every probed GCE
shape (1st-gen `n1`, 2nd-gen `e2`, 3rd-gen Intel `c3`, and AMD-SEV Confidential
`n2d`) boots with `intel_iommu=off`, `DMAR: IOMMU disabled`, SWIOTLB software
bounce buffering, empty `/sys/kernel/iommu_groups`, and no DMAR/IVRS/IORT
table; the Confidential shape forces bounce buffering as a memory-encryption
invariant. These rows are support-policy expectations, not a hardcoded
selection table: they describe what capOS's runtime probe should expect to find
on those shapes today, and they confirm that the fail-closed default lands on
the labeled bounce-buffer path there. The boot-time probe, not this matrix,
makes the binding selection on each boot, so a shape whose IOMMU exposure
changes is handled by the probe re-evaluating rather than by editing this text.
AWS and Azure shapes carry no live-probe evidence yet; the probe treats them
the same as any other unproven platform and defaults to the bounce-buffer
path.

### Runtime Selection Rule (Fail-Closed Default)

On each boot, capOS probes the platform for guest-programmable remapping
authority — IOMMU presence, programmability, and coverage for the
DMA-capable functions it intends to bind — and selects the backend
fail-closed:

1. Probe the platform. Discover DMA-capable functions, then test whether a
   usable Intel VT-d, AMD-Vi, or Arm SMMU unit is present, discoverable, and
   programmable, and whether its translation/fault/invalidation behavior passes
   the self-tests in [`docs/research/iommu-remapping.md`](research/iommu-remapping.md).
2. Select fail-closed. Select the **direct remapping domain** backend for a
   device only when the probe positively verifies a usable+safe IOMMU for that
   device. If the probe cannot verify it — IOMMU absent, not programmable,
   coverage unproven, self-test failed, or observations ambiguous — select the
   **labeled bounce-buffer fallback** for any DMA-capable surface the manager
   can keep manager-owned, or **Unsupported** when even that cannot hold. This
   probe-gated rule is the default: on an unproven platform the probe cannot
   verify, so direct DMA is not used and the bounce-buffer path is chosen.
3. There is no human in this loop. The machine decides per boot; the only
   external authority is the optional manifest override below.

This is the boot-time authority. The cloud VM evidence matrix above is the
support-policy expectation of what the probe should find, not the decision
itself.

### Manifest Override Field (Operator Authority Lever)

An operator can override the runtime default for a deployment through one
declarative, auditable enum field in the system manifest's kernel parameters:
the `dmaBackendPolicy` field of the `SystemConfig` struct in
`schema/capos.capnp`. It is config, not a doc-signing ritual, and is not gated
on any specific person. The field is absent by default, and the absent default
applies the probe-gated runtime selection rule above. The enum values and their
interaction with the probe result are:

| Value | Probe verifies usable+safe IOMMU | Probe cannot verify | Notes |
| --- | --- | --- | --- |
| *(field absent)* | direct remapping domain | bounce-buffer fallback | Default: the probe-gated runtime selection rule. Direct-DMA when the probe verifies a usable+safe IOMMU, bounce-buffer fallback otherwise (fail-closed). Identical to `enable-if-verified`. |
| `enable-if-verified` | direct remapping domain | bounce-buffer fallback | The explicit, auditable form of the default. Probe-gated direct-DMA with fail-closed bounce-buffer fallback. Redundant with the absent default but kept for explicit configuration. |
| `enable-unsafe` | direct remapping domain | direct remapping domain | Force direct-DMA even when the probe cannot verify it. The operator takes responsibility for the platform's DMA isolation; the value name carries the warning. Use only on a platform whose isolation is known-good out of band. |
| `bounce-buffer` | bounce-buffer fallback | bounce-buffer fallback | Pin the labeled bounce-buffer path and disable direct-DMA entirely, even where the probe would verify a usable IOMMU. The most conservative value. |

Selection rules that hold for every value:

- The absent default and `enable-if-verified` select direct DMA only when the
  probe verifies a usable+safe IOMMU, and otherwise fall back to bounce-buffer.
- `enable-unsafe` is the sole value that can pick direct DMA without probe
  verification. The value name is the acknowledgement; there is no separate
  per-shape "I-accept-unverified" ceremony.
- `bounce-buffer` never selects direct DMA, even where the probe would verify a
  usable IOMMU.
- When the selected backend is `Unsupported` for a device (no manager-ownable
  DMA-capable surface at all), the device stays unbound regardless of the
  override value. The override governs direct-vs-bounce, not whether an
  unbindable device is forced online.

This selection mechanism is implemented. The `dmaBackendPolicy` capnp enum
encodes an absent field as ordinal 0 (`unspecified`), which decodes identically
to `enable-if-verified`; an unrecognized ordinal decodes fail-closed to the
bounce-buffer path (never direct DMA) rather than failing the manifest parse or
honoring the probe-gated default. The kernel resolves the backend on each boot
from the IOMMU probe verdict and this override and emits a boot proof line of
the form `dma: backend selection dma_backend=<direct-remapping|bounce-buffer>
dma_backend_override=<absent|enable-if-verified|enable-unsafe|bounce-buffer>
probe_verified_usable_iommu=<bool>`. The bounded QEMU shapes prove the
probe-gated default end-to-end: `make run-iommu-remapping` (verifiable Intel
VT-d shape) records `dma_backend=direct-remapping dma_backend_override=absent`,
and `make run-dmapool-grant` (no usable IOMMU) records
`dma_backend=bounce-buffer dma_backend_override=absent`. The override values and
the unknown-ordinal fail-closed decode are covered by `cargo test-config` over
the shared selection rule, capnp round-trip, and CUE decode.

The production cloud (non-`qemu`) build emits the same backend selection on
the cloudboot harness's serial-port-1 path as
`cloudboot-evidence: dma-backend bounce_buffer` (in the harness token
namespace), and on the bounce-buffer verdict drives the production-path
bounce-buffer `DMAPool` + `DMABuffer` grant proof
(`kernel/src/cap/dmapool_bounce_buffer_grant_proof.rs`, called from
`kernel::run_init`): stage a parked manager-attached `DMAPool` over one
DMA-capable PCI function from the inventory through
`device_manager::stage_bounce_buffer_dmapool_record`
(`kernel/src/device_manager/stub.rs`), allocate one bounded bounce-buffer
`DMABuffer` through
`device_manager::issue_manager_attached_dmabuffer_handle_with_request`
(which calls
`device_dma::allocate_manager_attached_dmapool_bounce_buffer_page`),
assert the cap-info labels (`userspace_dmapool=manager-issued-bounce-buffer`,
`allocation=single-bounce-buffer-page`, `real_dma=not-attempted`,
`direct_dma=blocked`, `host_physical_user_visible=0`,
`iova_export=disabled-future-only`), quiesce-before-release
(`release_dmapool_record_for_cap_release` returns `pending-buffer-release`
while the buffer is live), scrub-before-reuse (the released bounce-buffer
frame is zeroed in place between scrub and frame-free), and
stale-handle-after-detach, then emit
`cloudboot-evidence: dma-pool-grant <token>` with shape
`<seg>.<bus>.<dev>.<fn>-pool.<slot>.gen.<gen>-phys.<hex>` (every character
inside the harness grammar `[A-Za-z0-9._-]+`). The proof stages a pool,
allocates one bounce-buffer page, asserts the invariants, and emits a marker:
it does **not** program real DMA, attach a queue, program interrupts, claim a
device for sustained ownership beyond the grant, or emit
`provider-nic-bound` / `storage-bound`. A future direct-remapping verdict
skips the proof rather than aliasing direct-DMA onto the bounce-buffer
assertion shape.

Implementation note, 2026-05-29 11:50 UTC: the production-cloud bounce-buffer stub
now implements the cap-side `DMABuffer.map(R+W)` / `DMABuffer.unmap`
admission and state-machine entry points
(`validate_dmabuffer_map_admission`, `record_dmabuffer_user_mapping`,
`begin_dmabuffer_user_mapping_unmap`, `restore_dmabuffer_user_mapping`,
`clear_dmabuffer_user_mapping` in `kernel/src/device_manager/stub.rs`) so a
userspace holder of a manager-issued `DMABuffer` cap can map the single
bounce-buffer page read/write, write payload bytes through that mapping,
unmap it, and observe `DMAPool.info.mapped_vmas` reflect the live
mapping (`make run-cloud-dmapool-grant`). The same `Absent -> Mapped ->
Unmapping -> Absent` state machine the QEMU side enforces governs the
parked slot: a duplicate map fails closed with the `DmaPoolLive` shape,
a `clear` before the cap-side aspace unmap returns
`DmaPoolTeardownEvidenceInvalid`, an identity-mismatched `begin`/`clear`
fails closed, and a post-`freeBuffer` `map` fails closed with the
`DmaBufferStaleHandle` shape. The manager remains the single owner of the
bounce-buffer page's device-visible host-physical address and IOVA: the
mapping does not expose either to userspace, real DMA stays
`not-attempted`, direct DMA stays `blocked`, and IOVA export stays
`disabled-future-only`. This is a local-QEMU proof of the userspace mapping
path only; it does not unlock a live cloud NIC bind, IOMMU programming, or
production direct-DMA authority.

Implementation note, 2026-06-03: Phase C slice 2
(`cloud_virtio_net_userspace_ownable_vring_proof`,
`make run-cloud-prod-nic-driver-userspace-ownable-vring`) wires this landed
bounce-buffer authority to a userspace virtio-net driver's **own vring** without
adding a new isolation backend. The driver allocates its descriptor / available /
used ring pages through a granted `DMAPool`, and the kernel programs the
virtio queue-address registers (`queue_desc` / `queue_driver` / `queue_device`)
with the manager-owned bounce host-physical address. The brokered-address-
publication model (above) holds: the driver never authors a device address. It
writes an opaque per-buffer device-usable handle (exported through
`DMABuffer.info.deviceIova` with scope `bounce-handle`, a deterministic
non-address encoding of the buffer's manager identity), and the kernel resolves
that handle against the live grant ledger
(`device_manager::stub::resolve_virtio_vring_device_address`) to the real
host-physical address before any MMIO write. The no-host-physical-exposure
invariant is preserved end-to-end: `host_physical_user_visible=0`,
`iova_export=disabled-future-only`, and reads of the queue-address base
registers are refused so the resolved address is never read back into userspace.
Out-of-grant, host-physical-looking, and stale-generation handle writes fail
closed, mirroring the NVMe doorbell-scan reject classes. `queue_enable` /
`DRIVER_OK` stay fail-closed (slice 3); this is a local-QEMU proof of the
userspace-ownable-vring path only, not a live cloud NIC bind.

### Downstream-Contract Scaffolding

A cloud NIC/storage driver declares its chosen backend through the
device-manager policy fields that already label the local paths in this design.
The contract below scaffolds the values each candidate requires; it is the
shape a driver preflight declares, not an authorization to enable any value.

| Policy field | Direct remapping domain | Labeled bounce-buffer fallback | Unsupported |
| --- | --- | --- | --- |
| `direct_dma` | `enabled` for the programmed per-device domain | `blocked` | `blocked` |
| `trusted_domain` | manager-owned domain id for the device | `none` | `none` |
| `bounce_buffer` | not required (mapped IOVA path) | `required` | N/A (device unbound) |
| `remapping_tables` | `programmed` | `not-programmed` | `not-programmed` |
| `hostile_hardware_isolation` | claimed only with per-domain remapping evidence + IOMMU hostile smokes | `not-claimed` | N/A |
| `exported_device_addresses` | `iova-only`, domain-labeled | none (no host physical or IOVA exposed) | none |

Each candidate must satisfy the following gates, mapped to the assurance-model
invariants in
[`docs/proposals/dma-assurance-model-proposal.md`](proposals/dma-assurance-model-proposal.md):

- **Direct remapping domain** — mapping before publication; no page reuse
  before teardown, with mapping removal and observed invalidation completion
  ordered strictly before scrub/reuse; stale-handle and stale-completion
  fail-closed; domain-scoped aliasing only (per-device context entries where a
  peer domain cannot resolve another domain's IOVA); no host-physical exposure
  (export only the domain-scoped IOVA, labeled as meaningless outside the
  domain); backend evidence explicit. The only landed remapping-domain evidence
  is the bounded QEMU Intel path in
  [`ddf-iommu-remapping-production-closeout`](tasks/done/2026-05-23/ddf-iommu-remapping-production-closeout.md),
  exercised by `make run-iommu-remapping`. A cloud shape needs its own
  guest-programmable remapping evidence before this candidate applies to it.
- **Labeled bounce-buffer fallback** — `direct_dma=blocked`; all device-visible
  memory is manager-owned bounce pages; no host physical address and no generic
  `MemoryObject`/`FrameAllocator` authority is exposed to the driver;
  stale-handle, stale-completion, and exit-under-DMA teardown fail closed;
  `hostile_hardware_isolation=not-claimed` stays explicit. The landed evidence
  is the S.11.2.7/8/9 hostile-smoke rows enforced by the `make run-net` gate
  (`tools/qemu-net-smoke.sh`), with the bounce-buffer virtio-net provider
  evidence in
  [`ddf-provider-virtio-net-driver-closeout`](tasks/done/2026-05-23/ddf-provider-virtio-net-driver-closeout.md),
  plus the bounce-buffer `DMAPool` grant lifecycle in `make run-dmapool-grant`.
  As of `ddf-real-dma-s112-hostile-smokes`, the S.11.2.7 stale-IRQ-after-reset
  and S.11.2.8 stale-DMA-completion-after-reset closure summaries are asserted
  on the dedicated `make run-dmapool-grant` gate as well, so the DMA-grant gate
  fails closed on a hostile-row regression without depending on `make run-net`;
  exit-under-DMA (in-flight drain, scrub, page free) is enforced by
  `make run-dmapool-grant-exit`. See the "Fallback Policy For No Usable IOMMU
  Exposure" and "Hostile-Smoke Acceptance Matrix" sections in this document for
  the full gate text.
- **Unsupported** — the device stays unbound or disabled; no driver-visible
  DMA, MMIO doorbell, interrupt ownership, or storage/network readiness claim
  is made. A serial boot line or PCI enumeration line is not readiness evidence.

The `models/dma/` TLA+ and Alloy files, the extracted Kani core, and the focused
Loom model are checked bounded evidence for these invariants. They supplement
the QEMU/host evidence above; they do not satisfy a candidate hardware or cloud
backend gate by the mere presence of model files. Each checked result records
its tool version, command, bounds, and output in
[`../models/dma/README.md`](../models/dma/README.md), and the CI placement is
tracked in [`backlog/security-verification.md`](backlog/security-verification.md).

### First QEMU Intel Remapping Smoke Acceptance Gate

Decision recorded `2026-05-14 09:07 UTC`. The first slice that programs real
Intel VT-d remapping state under QEMU has an explicit, bounded acceptance gate.
This unblocks the
[`ddf-iommu-qemu-intel-remapping-smoke`](tasks/done/2026-05-14/ddf-iommu-qemu-intel-remapping-smoke.md)
task, whose `## Acceptance` section carries the full gate; the summary here is
the design-level decision.

The first slice programs the minimum Intel VT-d path for exactly one selected
DMA-capable function under a pinned QEMU q35 `-device intel-iommu,aw-bits=39`
shape: one root entry, one context entry bound to a device-manager-owned domain
ID and a 39-bit second-level page-table root, and a single second-level mapping
from one device-visible IOVA page to one kernel-owned `DMAPool` page, plus the
root-table-address register write and the global-command/global-status
translation-enable handshake. Acceptance requires observable proof that the
mapped IOVA was translated and that an out-of-domain IOVA faults closed in the
fault-status/fault-recording registers.

Invalidation is part of the gate, not a follow-up. On revoke, device reset,
driver death, and `DMAPool` page release, the slice must remove the
second-level mapping and invalidate the relevant context-cache and IOTLB state
through the selected invalidation interface, observe invalidation completion,
and order page scrub/reuse strictly after that completion. This sits at the
existing "remove IOMMU mappings" and "scrub and free DMA pages" steps of the
`DeviceOwnerState` revocation order and must not be reachable before
`QueuesQuiesced` or a completed `Resetting` transition.

IOVA export discipline: host physical addresses stay hidden from userspace in
every result cap, diagnostic line, and audit record. The selected QEMU Intel
IOMMU-backed `DMABuffer.info` path may export only the domain-scoped IOVA for
the live mapped buffer generation, explicitly labeled as meaningless outside
that domain; fallback bounce-buffer paths keep IOVA export disabled.

Per-device domain granularity: the selected QEMU Intel path programs two
distinct per-device context entries and second-level roots for two claimed
DMA-capable functions under the same DRHD. Both domains may use the same IOVA,
but the peer domain's second-level walk is proven not to resolve that IOVA to
the primary page; stale and wrong-owner domain assignment fail closed, and
trusted multi-device sharing groups remain disabled.

The kernel-owned bounce-buffer fallback stays the path for VM shapes without
usable remapping hardware and must remain explicitly labeled
(`fallback_policy=kernel-owned-bounce-buffer-only`,
`remapping_tables=not-programmed`,
`hostile_hardware_isolation=not-claimed`); it must never be silently
reinterpreted as direct-DMA or hostile-hardware isolation. The IOMMU-backed
path adds stale-DMA-handle, stale-completion, descriptor-abuse,
revoke/reset-race, teardown-under-DMA, cross-domain stale-handle, and
fail-closed teardown branch hostile smokes while the existing bounce-buffer
exit-under-DMA and stale-DMA evidence (the `device-dma` stale-handle and
stale-completion proofs and the S.11.2.7/S.11.2.8 closure summaries) is
preserved unchanged.

Explicitly out of scope for this first slice: AMD-Vi table programming;
trusted multi-device sharing groups; scalable-mode context entries; interrupt
remapping and device-IOTLB options; 48-bit IOVA space / 4-level tables;
production NIC or storage driver ownership; userspace `DMAPool` direct-DMA
authority; and moving the live virtio-net path off bounce buffers. The
acceptance evidence is QEMU-only emulator evidence, not a hardware-isolation
claim. The smoke adds or selects a focused `make run-iommu-remapping` gate
asserted by `tools/qemu-iommu-remapping-smoke.sh`.

Implementation note, 2026-05-02 04:58 UTC: the ACPI discovery path recognizes DMAR and
IVRS in the root table walk, reports absent/valid/malformed/unsupported state,
records bounded table length/header facts, DMAR host address width and flags,
IVRS IVinfo/flags, and bounded remapping-structure type counts. Malformed DMAR
or IVRS structure lengths stop parsing, and unsupported shapes such as parser
scan-cap overflow leave `direct_dma=blocked` with `bounce_buffer=required`.

Implementation note, 2026-05-02 05:31 UTC: the attachment-policy slice also
retains DMAR DRHD include-all and bounded PCI endpoint device-scope metadata,
including segment, single-hop BDF, and remapping-hardware register base, and
reports each retained DMA-capable PCI function as IOMMU-attached/covered when
that static table metadata covers its segment/BDF. Bridge and multi-hop scopes
are diagnostic-only until PCI topology traversal can resolve them, and
include-all fallback fails closed when retained DRHD units or scopes are
capped. Functions without trusted static coverage are reported as uncovered;
covered functions are reported as attached/covered, but both paths keep
`dma_policy=prototype-bounce-buffer-only`, `bounce_buffer_required`, and
`blocked_direct_dma_devices` because remapping domains are unsupported. The
direct-DMA trusted-domain count remains zero, and userspace `DMAPool`,
`DeviceMmio`, and `Interrupt` authority remain unavailable.

Implementation note, 2026-05-02 07:27 UTC: the domain-policy staging slice adds
a `pci: dma-domain policy proof` line and a diagnostics mirror. The proof
reports the future domain owner as the device manager, the domain granularity
as per-device or trusted-sharing-group, exported device addresses as IOVA-only,
`host_physical_user_visible=0`, `direct_dma_trusted_domains=0`,
`claimed_device_domains_ready=0`, `remapping_tables=not-programmed`,
`remapping_domains=not-started`, userspace `DMAPool`/`DeviceMmio`/`Interrupt`
as `not-started`, and prototype devices as kernel-owned bounce-buffer-only.
Malformed, unsupported, absent, or retained-capped metadata leaves direct DMA
blocked; `proof_result=ok` is only evidence for that conservative
blocked-direct-DMA policy.

Implementation note, 2026-05-09 18:47 UTC: the blocked-direct-DMA admission
decision now lives in the pure `capos-lib::device_authority` validator next to
the DMA/MMIO/IRQ handle validators. Host tests cover the current all-prototype
bounce-buffer shape, fail-closed results if any direct trusted domain is
claimed before the policy is ready, fail-closed results if the prototype
bounce-buffer count does not cover every DMA-capable function, and the absent,
malformed, unsupported, and retained-capped metadata labels. The kernel PCI
proof line and diagnostics mirrors consume that pure decision while preserving
the existing `direct_dma=blocked`, `remapping_tables=not-programmed`,
`domain_activation=not-started`, and `policy=blocked-direct-dma` labels. This
is IOMMU/remapping groundwork only; it does not program remapping tables,
create trusted domains, expose host physical addresses, or enable production
userspace DMA authority.

Implementation note, 2026-05-02 15:29 UTC: the COM1 `devices` diagnostics
command now prints the same bounded DMA-domain policy facts without naming an
owner identity. The line explains that all current DMA-capable prototype
functions remain on `direct_dma=blocked`, `bounce_buffer=required`,
`direct_dma_trusted_domains=0`, `claimed_device_domains_ready=0`,
`remapping_tables=not-programmed`, `exported_device_addresses=iova-only`,
`host_physical_user_visible=0`, and
`prototype_devices=kernel-owned-bounce-buffer-only`. This is a diagnostics
mirror for the current conservative policy, not evidence that IOMMU remapping
domains or userspace DMA authority exist.

Implementation note, 2026-05-02 15:45 UTC: attached device-manager `DMAPool`
records now store the current explicit bounce-buffer policy and the QEMU
device-manager proofs read it back through the active device record plus the
matching `DmaPoolHandle`. The logged policy scope is
`device-manager-attached-dmapool-bounce-buffer-policy`, with
`direct_dma=blocked`, `bounce_buffer=required`, `trusted_domain=none`,
`remapping_tables=not-programmed`, `remapping_domain=not-started`,
`userspace_dmapool=not-started`, `host_physical_user_visible=0`, and
`policy_bound_to_manager=true`. This binds the conservative policy to current
manager state; it still does not program remapping domains, expose userspace
`DMAPool`, or perform real DMA mapping teardown.

Implementation note, 2026-05-11 00:00 UTC: attached device-manager `DMAPool`
policy records now also carry an explicit manager-owned remapping-domain
ledger staging record. The lifecycle and imported-live proofs report
`remapping_domain_ledger_scope=device-manager-attached-dmapool-remapping-domain-ledger`,
`static_iommu_coverage=acpi-pci-diagnostic-only`,
`remapping_domain_owner=device-manager`,
`remapping_domain_granularity=per-device-or-trusted-sharing-group`,
`remapping_domain_ledger=manager-owned-staging-record`,
`remapping_domain_ready=false`, and
`iova_export=disabled-future-only`, while preserving
`direct_dma=blocked`, `remapping_tables=not-programmed`, and
`host_physical_user_visible=0`. This is a software ledger/readiness record
only: capOS still does not program Intel VT-d/AMD-Vi/QEMU remapping tables,
create a trusted direct-DMA domain, expose host physical addresses or IOVAs,
or claim production hostile-hardware DMA isolation.

Implementation note, 2026-05-11 17:07 UTC: the same manager-owned
remapping-domain staging record is now an explicit activation gate tied to the
active `DMAPool` record and matching handle. The device manager validates that
gate before current `DMAPool` policy/accounting and buffer issue paths proceed.
The gate reports `domain_ownership=manager-owned-active-dmapool`, but keeps
`direct_dma=blocked` because `remapping_table_programming=not-programmed`,
`iova_export=disabled-future-only`,
`remapping_invalidation_policy=not-installed`,
`remapping_iotlb_flush_policy=not-installed`, and
`remapping_stale_mapping_cleanup=not-installed`; the selected fallback remains
`remapping_fallback_policy=kernel-owned-bounce-buffer-only`. The activation
result is `blocked-remapping-prerequisites-missing` with
`remapping_activation_gate=fail-closed`,
`remapping_activation_blocker=remapping-tables-not-programmed`, and
`remapping_activation_side_effect=side-effect-blocked`. This is a software
policy gate and proof surface only: capOS still does not program Intel
VT-d/AMD-Vi/QEMU remapping tables, create a trusted direct-DMA domain, export
IOVAs or host physical addresses, remove real IOMMU mappings, flush IOTLB
state, or prove IOMMU-backed hostile stale-DMA behavior.

Implementation note, 2026-05-12 18:49 UTC: the manager-owned staging record
now includes a concrete per-device remapping-domain identity for the active
`DMAPool` handle: claimed-device domain identity, staged single-device sharing
group, BDF-derived device id, pool slot, pool generation, and owner generation.
The activation preflight treats that identity binding as a prerequisite before
direct DMA could be considered, and the QEMU lifecycle/imported-live proofs emit
a `dmapool remapping domain identity proof` line. The direct-DMA blocker remains
unchanged: remapping tables are still `not-programmed`, IOVA export is still
`disabled-future-only`, invalidation, IOTLB flush, and stale mapping cleanup are
still `not-installed`, direct DMA remains `blocked`, and the fallback remains
`kernel-owned-bounce-buffer-only`.

Implementation note, 2026-05-12 21:19 UTC: the same manager-owned
remapping-domain ledger now carries a separate mapping-lifecycle preflight
record. The record is bound to the active `DMAPool` handle and claimed-device
domain identity, and the existing device-manager policy gate validates it
before accepting the current bounce-buffer attach/accounting/buffer-issue
paths. Its direct-DMA result remains fail-closed with explicit blockers:
IOVA space, mapping install, removal before page reuse, invalidation policy,
IOTLB flush policy, and stale mapping cleanup are all `not-installed`. This
is still an in-repo software preflight only; it does not program remapping
tables, expose IOVAs or host physical addresses, enable direct DMA, remove
real IOMMU mappings, flush IOTLB state, or prove IOMMU-backed hostile
stale-DMA behavior.

Implementation note, 2026-05-12 22:26 UTC: capOS now has an Intel/QEMU
remapping table scaffold that can represent a DRHD identity field, PCI
segment/BDF/source ID, domain ID, QEMU Intel address-width choice, disabled
root/context entries, and a second-level page-table-root placeholder. PCI
diagnostics can bind that scaffold to discovered DRHD/segment metadata when
ACPI/PCI discovery provides it. The disabled backend registry's only accepted
active state is `disabled`. The proof labels distinguish representability
from programming: root-table pointer, context-entry programming, invalidation
registers, fault registers, protected-memory registers, and invalidation
queue remain `not-written`; remapping tables remain `not-programmed`;
hardware programming remains `not-attempted`; direct DMA remains `blocked`;
IOVA export remains `disabled-future-only`; and host physical addresses remain
hidden from userspace. This is still not Intel VT-d, AMD-Vi, or QEMU IOMMU
programming.

Implementation note, 2026-05-23 18:06 UTC: the first production `DMAPool` ledger
integration for the QEMU Intel remapping path now maps the selected
virtio-rng request-buffer IOVA to an active manager-owned `DMABuffer` page.
Mapping install is admitted through the matching active `DmaPoolHandle` and
`DmaBufferHandle` generations, stale pool and buffer generations fail closed,
and wrong-owner mapping attempts are side-effect blocked. On teardown, the
target second-level leaf is removed, the context-cache and IOTLB completion
polls finish, and only then is the `DMABuffer` released through the production
device-manager ledger. The proof keeps the IOVA internal, keeps
`host_physical_user_visible=0`, keeps userspace IOVA export disabled for this
slice, and leaves the no-remapping fallback policy as
`kernel-owned-bounce-buffer-only`.

Implementation note, 2026-05-23 19:18 UTC: QEMU Intel remapping fault
reporting now decodes VT-d `FSTS` plus `FRCD[0]` into a bounded kernel record
for the faulting IOVA, reason, requester source ID, and DMA read/write type.
The unmapped-IOVA, stale-handle, and stale-completion proofs record the fault,
clear it with write-1-to-clear semantics, verify the clear-after-record state,
and report source/IOVA match status without exposing host physical addresses.
The COM1 `devices` diagnostics path now prints an IOMMU fault summary that
reserves `fault_summary=clean` for a successful clear fault-status read and
labels unavailable fault-status reads as unavailable/fail-closed. Owner
identity, DRHD register bases, and host physical addresses stay hidden. The
optional audit route is explicitly not wired for this slice
(`volatile_audit=not-routed`).

Implementation note, 2026-05-13 15:29 UTC: active manager-owned `DMAPool`
remapping preflight records now consume the same retained DMAR DRHD/requester
metadata when PCI coverage is complete. The active disabled Intel/QEMU
scaffold records the retained DRHD identity and requester segment/BDF/source
ID for the bound pool handle, but absent, malformed, capped, unsupported, or
uncovered metadata still leaves the scaffold not-bound and disabled. This is
metadata-only binding: capOS still does not program root/context tables,
install or remove mappings, invalidate remapping caches, flush IOTLB state,
export IOVAs, enable direct DMA, or expose host physical addresses.

Implementation note, 2026-05-12 23:07 UTC: PCI diagnostics now include a
separate Intel/QEMU remapping MMIO-status proof for the selected DMA-capable
function. When complete retained DMAR DRHD metadata covers that function and
the register base is page-aligned, capOS maps only the selected
remapping-register page for bounded volatile diagnostic reads of the version,
capability, global-status, root-table-address, and fault-status registers. The
mapped label describes the diagnostic access pattern, not page-table write
protection. When the default diagnostics shape has no DRHD, metadata is capped,
or the retained DRHD base is invalid, the same proof reports
`mmio_window=not-mapped`, `mmio_read=not-attempted`, unavailable
capability/status/fault reads, and a fail-closed reason. The labels preserve
`remapping_tables=not-programmed`, `direct_dma=blocked`,
`fallback_policy=kernel-owned-bounce-buffer-only`, and
`hostile_hardware_isolation=not-claimed`. This is not remapping-domain
activation: capOS still does not write VT-d, AMD-Vi, or QEMU remapping
registers, install root/context tables or invalidation queues, export IOVAs,
or claim hostile-hardware DMA isolation.

Implementation note, 2026-05-13 01:20 UTC: active manager-owned `DMAPool`
records now also carry a generic disabled IOVA ledger under the
remapping-domain record. The ledger binds a domain-scoped reservation identity,
hidden internal range metadata, and reservation generation to the active owner
and pool generation. The same device-manager accounting path validates that
ledger before current bounce-buffer allocation, descriptor submission,
completion accounting, buffer free/page release, and pool release checks
proceed. Pure tests and QEMU proof labels reject stale reservation generations
and wrong owner generations as `disabled-iova-ledger-stale` with side effects
blocked. The active state remains disabled: proof output keeps the internal
range hidden and reports `iova_base_user_visible=0`,
`host_physical_user_visible=0`, `iova_export=disabled-future-only`,
`direct_dma=blocked`, `mapping_install=not-installed`,
`mapping_remove_before_page_reuse=not-installed`,
`invalidation_policy=not-installed`, `iotlb_flush_policy=not-installed`, and
`stale_mapping_cleanup=not-installed`. This still does not program remapping
tables, export IOVAs, expose host physical addresses, install or remove real
IOMMU mappings, flush IOTLB state, or claim hostile-hardware isolation.

Implementation note, 2026-05-11 18:44 UTC: the selected userspace virtio-net
TX provider smoke now grants a runtime-visible `DeviceMmio` notify BAR cap
named `notify_mmio`, but keeps the active DMA posture unchanged. The provider
still uses manager-owned bounce buffers, `direct_dma=blocked`, and
`host_physical_user_visible=false`; the notify cap is a no-write MMIO
admission boundary over the selected virtio-net TX notify offset, not a direct
DMA or descriptor-ring ownership transfer. The selected submit path validates
descriptor authority and scrubs the bounce page before consuming the
grant-derived notify policy, and it proves wrong value, wrong offset, stale
handle, and stale generation block before any doorbell. This does not program
IOMMU/remapping tables, export IOVA or host physical addresses, mutate the
real virtio-net descriptor ring, or claim production NIC isolation.

Implementation note, 2026-05-11 19:05 UTC: the `notify_mmio` grant remains
runtime-visible but is now explicitly no-direct-MMIO as well as no-write.
`DeviceMmio.map` and `DeviceMmio.read32` for that provider notify cap return
typed blocked results before any user VMA or register read, `DeviceMmio.info`
validates the live mapping generation before accepting the provider-notify
record, and `notify_mmio` detach clears the submit-path notify policy so a
later selected submit reports stale-handle blocking with no doorbell write.
The submit path also invalidates the cached notify policy on owner-generation
transitions and stale/missing cap-release detach boundaries before accepted
no-write authority can be reported.

Implementation note, 2026-05-11 19:56 UTC: the same provider smoke now grants
a runtime-visible `Interrupt` cap named `tx_interrupt` for the selected
virtio-net TX MSI-X route snapshot. This extends the grantable authority
boundary without changing the DMA posture: the provider still uses
manager-owned bounce buffers, direct DMA and IOMMU remapping remain blocked,
and the interrupt cap only proves generation-checked admission for
info/ack/unmask/wait/mask plus waiter cancellation and stale-after-release
blocking. Later bounded follow-ups add selected TX MSI-X vector-control
mask/unmask; this first grant slice did not deliver provider IRQs to userspace,
acknowledge or mask/unmask hardware, ring a doorbell, or mutate real
virtio-net descriptor rings.

Implementation note, 2026-05-11 20:09 UTC: provider TX waiters are now
separate no-delivery waiter-table entries rather than generic delivery
waiters. A pending `tx_interrupt.wait` remains pending across TX route
delivery-count advancement and only completes through the explicit
mask/release cancellation path. The staged provider TX grant source also
tracks a live-issued cap and refuses another live `tx_interrupt` alias for the
same route snapshot.

Implementation note, 2026-05-12 09:13 UTC: the selected userspace virtio-net
TX provider path now performs one bounded real descriptor/avail publication on
queue `1` while keeping the DMA posture conservative. The descriptor points at
the manager-owned bounce buffer already governed by the `DMABuffer` record;
the submit path validates live buffer identity, scrubs before publication,
requires the live no-write `notify_mmio` policy, asserts that the descriptor
page is ledgered to the virtio-net TX queue, and blocks wrong queue, stale
notify policy, and a real stale `DMABuffer.submitDescriptor(queue=1)` attempt
at the stale capability/liveness boundary before touching the real ring. The
proof logs descriptor/avail/used ring physical addresses for kernel evidence,
but those addresses are not returned to userspace. Because this slice does not
claim real used-ring/CQ completion, the
published page remains pinned in the manager in-flight record; userspace
`completeDescriptor`, `freeBuffer`, post-publication remap, and cap-release
drain do not retire, remap, or release it. A follow-up rings one selected TX
virtqueue notify doorbell after the same descriptor authority, submit-scrub,
live `notify_mmio` policy, submit-effect write, and publication gates, while
wrong-queue and stale-notify or stale-`DMABuffer` paths remain not-written.
Readback-mismatch publication failures do not write the immediate doorbell and
are treated as possibly-published ring state that quiesces later TX
notification and keeps the manager buffer pinned rather than claiming rollback.
Pre-publication bounce-page metadata remains `doorbell_write=not-written`.
Any immediate used-ring or IRQ effect from that doorbell is recorded only as
an out-of-scope hardware side effect. Later 2026-05-12 follow-ups advanced the
selected path to a bounded used-ring completion handoff:
`DMABuffer.completeDescriptor` validates the live manager-attached buffer and
in-flight descriptor id, observes the real TX used ring for the stored
software descriptor generation, consumes that entry through the existing
descriptor tracker and DMA ledger, and only then clears the manager in-flight
record. As of commit `e248d42b` (`2026-05-23 13:36 UTC`), kernel TX helpers
stay quiesced after provider ownership starts while the provider path can
publish the full selected TX queue-depth window of eight descriptors before the
first completion; the smoke records `live_inflight_after_submits=1/2/3/4/5/6/7/8`
(the ninth allocation rejected `dmapool-budget-exceeded`), blocks map/free/reuse
while any buffer is in flight, and proves wrong-order descriptor `7` used-ring
handling preserves the observed descriptor `0` completion for its matching
generation.
The
provider-facing
`tx_interrupt` waiter is a
runtime-visible completion-event consumer for the same selected route;
delivery validates the expected TX source id, source generation, route
generation, owner, driver-unmasked state, and live issue id before completing
each bounded event. A 2026-05-13 follow-up adds the bounded
incomplete-descriptor teardown drain: when one descriptor has completed and
seven remain incomplete, release retires only the incomplete descriptors'
allocation-backed TX DMA ledgers and clears only the selected virtqueue
descriptor/used-ring tracking needed for those releases, while CQ publication
and provider IRQ delivery stay blocked and the pending waiter remains
undelivered until the smoke explicitly cancels it through the existing
mask/cancel path. Commit `e248d42b` (`2026-05-23 13:36 UTC`) extends that drain
to the full selected TX queue-depth window and keeps the completed descriptor's
buffer retained until it is explicitly freed. A later 2026-05-13 remediation
binds each provider TX
in-flight descriptor to the submission-time provider issue/source/route
generation. If that old descriptor completes after `tx_interrupt`
release/regrant, `DMABuffer.completeDescriptor` fails closed as
`dmabuffer-provider-tx-stale-issue` before consuming the used ring, publishing
provider CQ/IRQ state, or advancing provider acknowledgements; later cap
release may still drain the descriptor as teardown-only.
`tx_interrupt.wait` posting is serialized with provider release, mask, and
delivery, and stale issue ids fail closed at admission and insertion. A later
2026-05-13 follow-up lets `tx_interrupt.acknowledge` account exactly one
already observed selected TX dispatch token paired with one delivered provider
CQ event; the smoke proves pre-event, duplicate, teardown-drain, masked-route,
reset/regrant, stale-after-release, and stale issue acknowledgements fail closed
before delivery-count, route-state, CQ, ack-ledger, or hardware-dispatch-ack
mutation. This is still bounded selected-route evidence: provider IRQ
ownership, deferred EOI, LAPIC/MSI-X acknowledgement, direct DMA, IOMMU mapping,
full virtio-net ownership, production NIC/storage migration, and cloud
readiness remain open. Commit `e248d42b` (`2026-05-23 13:36 UTC`) adds
release-time retirement for delivered but unacknowledged bounded provider TX CQ
events: the release proof now records seven pending provider completion acks
retired from the ledger in one live issue while preserving the separate
stale-bound in-flight descriptor proof, with stale post-release ack revoked and
no hardware ack claimed. A 2026-05-13 follow-up adds bounded selected TX MSI-X
mask/unmask only: live provider `tx_interrupt.mask` and `unmask` toggle the
selected TX vector-control bit plus route state, preserve generations and
delivery counts, and block stale issues before side effects.

Implementation note, 2026-05-02 19:43 UTC: the bounded zero-live
device-manager `DMAPool` lifecycle proof now treats its manager-attached DMA
buffer record as teardown-blocking metadata. The pool detach path still checks
authoritative live accounting first, then rejects zero-live detach while the
proof buffer is attached as `dmapool-buffer-attached`. Before the active free
path, the proof validates stale same-slot and wrong-identity `FreeBuffer`
operations through
`capos-lib::device_authority::validate_dma_buffer_operation`. The wrong
identity cases cover wrong owner generation, wrong pool slot, wrong pool
generation, and wrong buffer slot; each records `dmabuffer-stale-handle`, the
exact validator reason (`stale-owner-generation`, `wrong-pool`,
`stale-pool-generation`, or `wrong-slot`), `side-effect-blocked`, and a
preserved manager-owned buffer record. The stale same-slot case continues to
record `stale-slot-generation` and `buffer_stale_free_preserved=true`, then
the proof observes that pool detach still fails as
`dmapool-buffer-attached`. The proof clears the gate only after validating a
proof-scoped active `FreeBuffer` operation, scrubbing and freeing the
kernel-owned proof frame, and detaching the manager-owned buffer record. This
remains lifecycle evidence only: no userspace `DMAPool` or DMA-buffer
authority is exposed, no physical address or IOVA is exposed, and S.11.2
hostile stale-DMA smokes remain open.

Implementation note, 2026-05-03 02:31 UTC: the same zero-live
device-manager `DMAPool` lifecycle proof now validates manager-record
`CompleteDescriptor` authority for the attached `DmaBufferHandle`. Active
completion validation records `buffer_active_complete_result=ok`; freed-buffer,
reused-slot generation, and stale-after-revoke completion attempts fail closed
as `dmabuffer-stale-handle` with exact pure-validator reasons and
`side-effect-blocked`. This is manager-record validation evidence only: it does
not complete a real descriptor, publish a completion queue entry, grant a
userspace `DMABuffer`, run real DMA, or clean up or reuse production
userspace DMA pages.

Implementation note, 2026-05-14 14:05 UTC (DDF IOMMU remapping Slice A1): the
first slice that programs real Intel VT-d remapping state under QEMU has
landed. The `## First QEMU Intel Remapping Smoke Acceptance Gate` above defines
the full bounded gate; that gate is being delivered as a sequenced A1/A2/B/C
split (the slice was correctly scoped as bigger than one reviewable unit). This
note records Slice A1.

Pinned QEMU shape: `qemu-system-x86_64` 8.2.2, `-machine q35`,
`-device intel-iommu,aw-bits=39` (3-level second-level page tables, 39-bit IOVA
space). The kernel `iommu` module (`kernel/src/iommu.rs`, `cfg(qemu)`-only)
selects one DMA-capable function that is **not** the live virtio-net
bounce-buffer path (virtio-rng under the default smoke shape), allocates a root
table, one context table, a 3-level second-level page table, and one mapped
DMA page; encodes and writes one root entry, one context entry (binding the
requester source id to a domain id, `aw-bits=39`, and the second-level root),
and the second-level table-pointer / leaf entries through the HHDM; writes the
Root Table Address Register; and runs the global-command/global-status
`SRTP`-then-`TE` handshake, polling the status register for each step. The
capability-register extended-capability `IRO` field (IOTLB register offset) is
decoded and reported for Slice B's benefit. MMIO ordering invariants are
enforced with `SeqCst` fences: between the last in-memory table-entry write and
the RTAR write, between the RTAR write and `GCMD.SRTP`, and between the latched
root pointer and `GCMD.TE`.

Slice A1 proves translation with **kernel-side structural evidence only**, which
the gate's IOVA-export-discipline clause explicitly permits. Hardware confirms
translation-enabled (`GSTS.TES` + `GSTS.RTPS` polled set), the written entry
words are read-back-verified through the HHDM, the pure
`capos_lib::device_authority` validator accepts the layout, and the unmapped
IOVA's 3-level walk structurally terminates at a non-present entry. The proof
labels are scrupulously honest: `mapped_iova_translated=structural` (not
`hardware-dma`), `unmapped_iova_fault=structural-not-present` (not `observed`),
`proof_evidence=kernel-side-structural`. A real hardware-DMA translation and a
real fault-status fault require driving a device virtqueue through the IOMMU;
that is **deferred to follow-on task A2** (a virtio-rng virtqueue driver as the
DMA proof vehicle). Invalidation + IOTLB flush with completion polling (Slice B)
and the IOMMU-backed stale-handle / stale-completion hostile smokes (Slice C)
are also follow-on slices; at A1 their proof lines emit
`proof_result=deferred-next-slice`. A2, B, and C have since landed — see the
implementation notes below.

The table pages are recorded in a bounded ledger modeled on the device-manager
`DMAPool` page-accounting discipline (allocate-record, scrub-before-free on the
fail-closed path); mapping removal with IOTLB-flush-ordered scrub/free is Slice
B. IOVA export stays disabled (`iova_export=disabled-this-slice`), no host
physical address is user-visible (`host_physical_user_visible=0`), and no
hostile-hardware isolation is claimed (`hostile_hardware_isolation=not-claimed`).
The kernel-owned bounce-buffer fallback is unchanged for QEMU shapes without
usable `intel-iommu` hardware and is emitted with the explicit
`fallback_policy=kernel-owned-bounce-buffer-only` /
`remapping_tables=not-programmed` labels. The new `make run-iommu-remapping`
gate is asserted by `tools/qemu-iommu-remapping-smoke.sh`; `make run-net`,
`make run-dmapool-grant`, and `make run-diagnostics` continue to prove the
fallback path unchanged.

Implementation note, 2026-05-14 15:19 UTC (DDF IOMMU remapping Slice A2): the
device-DMA proof vehicle has landed, upgrading the A1 structural proof to a
real hardware-DMA proof and closing the literal hardware-DMA text of gate part
1. After the VT-d tables are programmed and `GCMD.TE` is set, a minimal
virtio-rng virtqueue driver — split into a mapped-DMA phase
(`crate::virtio::prove_iommu_rng_mapped_dma`) and an unmapped-DMA phase
(`crate::virtio::prove_iommu_rng_unmapped_dma`) — drives the device QEMU
exposes on the `intel-iommu` shape. The second-level
table now installs four leaf entries inside one shared L1 page: the request
buffer plus the three virtqueue ring pages (descriptor table, available ring,
used ring). The driver programs the device's `QUEUE_DESC` / `QUEUE_DRIVER` /
`QUEUE_DEVICE` registers and the request descriptor's `addr` field with the
**programmed IOVAs**, never the host-physical page addresses. Because VT-d
translation is global per requester once `GCMD.TE` is set, every DMA the
device issues — every ring access and the entropy write — must walk the
second-level table. A used-ring completion plus a non-zero buffer reading
therefore proves a real hardware DMA reached the kernel page through the
programmed IOVA translation: `mapped_iova_translated=hardware-dma`,
`proof_evidence=virtio-rng-hardware-dma`.

The driver then publishes a second descriptor whose `addr` is the
deliberately-unmapped IOVA and kicks the device. The device's DMA to that IOVA
raises a real VT-d translation fault; the kernel reads it back out of the Fault
Status Register (`FSTS.PPF`), the first Fault Recording Register's fault bit
(`FRCD[0].F` at the decoded `CAP.FRO` offset), and the faulting page address
recorded in `FRCD[0]` — which must equal the unmapped IOVA the device was
pointed at — and reports `unmapped_iova_fault=observed` with the
`fault_recording_reason` code. The fault gate is purely the VT-d register
surface: whether QEMU's virtio-rng still pushes the faulting descriptor onto
the used ring afterward (it does, with the entropy write dropped) is QEMU
device behavior, reported as the `unmapped_descriptor_uncompleted` diagnostic
field but deliberately not a gate condition. The fault registers are cleared
(write-1-to-clear) before the device DMA and again after the observed-fault
read so no stale fault is mistaken for the proof and no fault is left for a
later VT-d consumer. The MMIO discipline reuses A1's `NO_CACHE` mapping and the
descriptor/available-ring writes are `SeqCst`-fenced before the notify
doorbell. The two phases are deliberately split so the kernel reads the fault
registers strictly between them — the unmapped-IOVA descriptor is never in
flight while the mapped-DMA result is judged. The virtio-rng device negotiates
`VIRTIO_F_ACCESS_PLATFORM`, which is what makes QEMU route its DMA through the
platform IOMMU rather than treating the ring registers as host-physical
addresses; the `run-iommu-remapping` make target therefore creates the
virtio-rng device with `iommu_platform=on` (a target-scoped override of the
shared `QEMU_SECOND_DEVICE`, which no other run target needs because none of
them drives virtio-rng DMA). The IOMMU-backed hostile smokes (Slice C) were a
follow-on at A2 (`proof_result=deferred-next-slice`) and have since landed —
see the Slice C implementation note below; IOVA export stays disabled and no
host-physical address is user-visible.

Implementation note, 2026-05-14 17:19 UTC (DDF IOMMU remapping Slice B): the
invalidation + IOTLB flush + invalidation-ordered scrub/free has landed,
closing gate part 2. After the A2 hardware-DMA proof, `kernel/src/iommu.rs`
runs a revocation cycle (`run_invalidation_revocation_cycle`) that models the
device-manager `DeviceOwnerState` revocation FSM at the
`QueuesQuiesced -> Resetting -> DmaMappingsRemoved -> Dead` steps. The cycle
removes the four second-level leaf entries the A2 layout installed (request
buffer + the three virtqueue ring pages), `SeqCst`-fences so the in-memory
removal is visible to the IOMMU before the flush, then issues two
**register-based** invalidations: a context-cache invalidation through the
Context Command register (`CCMD_REG` at offset `0x28`, `CCMD.ICC` set with
`CCMD.CIRG` global granularity, polling `CCMD.ICC` clear for completion) and a
domain-selective IOTLB invalidation through the IOTLB register at the
A1-decoded `CAP.IRO` offset + 8 (`IOTLB.IVT` set with `IOTLB.IIRG`
domain-selective granularity and the domain id in `IOTLB.DID`, polling
`IOTLB.IVT` clear and reading `IOTLB.IAIG` back non-zero to confirm the
request was serviced). Both completion polls are bounded by the same
`VTD_STATUS_POLL_BUDGET` the A1 status handshakes use.

The hard ordering invariant — the whole point of the slice — is that the eight
VT-d ledger-owned table/ring/used pages and the separate production
DMAPool-owned request-buffer page are scrubbed and returned to their ledgers
**strictly after** both completion polls return. A `SeqCst` fence sits between
the completion reads and the scrub/free so the ordering is explicit in program
order. A poll that exhausts its bounded budget **fails closed**:
`invalidation_completed` is `false`, the pages are deliberately **not** freed
(a page reused while hardware may still hold a cached translation through it is
a stale-DMA hole), the ledgers keep them accounted rather than
leaked-and-forgotten, and the proof line reports
`proof_result=fail-closed`. Slice B uses **register-based invalidation only**:
no `GCMD.QIE` queued-invalidation bit is set, so the A1 single-bit-`GCMD`
discipline (correct only by minimalism — no other persistent `GCMD` bit set)
still holds and the `GCMD`-reconstruct boundary is not crossed. The
production DMAPool programming-abort path follows the same rule: if VT-d
programming fails before root/translation state can expose the DMAPool page to
hardware, the prepared DMABuffer/DMAPool records are detached; if the mapping
may already be hardware-visible, the partial VT-d ledger is carried through
the same leaf-removal and invalidation teardown before any VT-d table/ring page
or production DMAPool page can be reused. The
`make run-iommu-remapping` smoke and `tools/qemu-iommu-remapping-smoke.sh` now
assert the `invalidation proof` line as `proof_result=ok` with
`mapping_removed=true context_cache_invalidated=true iotlb_flushed=true`
`iotlb_actual_granularity_nonzero=true invalidation_completed=true`
`page_reuse_ordered_after_invalidation=true table_pages_live_after=0`
`invalidation_interface=register-based-ccmd+iotlb`, and forbid both a
regression to the Slice-A deferred label and any `invalidation_interface=queued`
value. The IOMMU-backed stale-handle / stale-completion hostile smokes (Slice
C) were the deferred follow-on; they have since landed (see the note below),
and as part of that work the single-phase `run_invalidation_revocation_cycle`
was refactored into a two-phase `run_target_revocation_phase` +
`complete_revocation_teardown` so the hostile re-drive can sit between the
phases — the Slice B contract (every page freed strictly after its mappings
are invalidated) is unchanged, and the combined `invalidation proof` line
still asserts `proof_result=ok` for the complete teardown. IOVA export stays
disabled and no host-physical address is user-visible.

Implementation note, 2026-05-14 19:13 UTC (DDF IOMMU remapping Slice C): the
IOMMU-backed hostile stale-DMA smokes have landed, closing gate part 5 and the
parent IOMMU remapping task. Closing the slice required refactoring the Slice
B revocation into **two phases** so the hostile re-drive can run against a
*partially* revoked remapping — the original single-phase cycle freed every
page (including the virtio-rng descriptor table and available ring) before any
hostile re-drive could observe a fault, so the re-driven device read an
all-zero descriptor and issued no DMA at all. The `kernel/src/iommu.rs` ledger
now classes each page by revocation-phase role: `Target` (request buffer +
used ring — what the device's DMA *lands on*), `RingInfra` (descriptor table +
available ring — what the device *reads* to issue a DMA), and `Table`
(root/context/second-level tables). `run_target_revocation_phase` removes the
`Target` second-level leaves, invalidates the context-cache + IOTLB, and frees
the `Target` pages — while deliberately keeping the `RingInfra` + `Table`
pages mapped and live. `run_hostile_stale_dma_cycle` then re-drives the
**same still-live old-generation** virtio-rng device through the new
`crate::virtio::prove_iommu_rng_stale_dma` (each re-drive uses a fresh
available-ring index past the A2 phases so the device sees a genuinely new
descriptor): because the ring-infra pages are still mapped, the device reads a
*valid* descriptor whose `addr` is a *revoked* target IOVA, so the DMA faults
in the IOMMU (`FSTS.PPF` + `FRCD[0].F`, recorded faulting address is the stale
IOVA) instead of reaching memory. A stale mapping-install attempt is refused
(`attempt_stale_mapping_install` — the `RevokedRemapping` token is a
dead-domain receipt with no live table handle, not install authority), and the
freed `Target` page reads back as the scrubbed zeros. A second re-drive at the
revoked used-ring IOVA faults too, publishes no device-written used-ring CQ
entry into the freed page, exposes no memory to a would-be new owner, and
makes no freed page eligible for reuse. `complete_revocation_teardown` then
finishes the Slice B teardown by revoking + freeing the `RingInfra` + `Table`
pages; the combined gate-part-2 `invalidation proof` line
(`invalidation_phases=target-then-ringinfra`) still asserts `proof_result=ok`
for the full two-phase teardown, with the same hard ordering invariant in each
phase (pages freed strictly after that phase's invalidation completion polls
return; a poll that exhausts its budget fails closed and the phase's pages are
not freed). The load-bearing observation is the **revoked translation state**
— not device cooperation, not a software ledger drop — blocking the stale DMA,
confirmed by the VT-d fault registers plus a freed-page-stays-scrubbed
read-back through the HHDM. Existing bounce-buffer stale-DMA evidence (the
`device-dma` S.11.2.7/S.11.2.8 proofs) is preserved unchanged; the
IOMMU-backed hostile smokes are strictly additive. The
`make run-iommu-remapping` smoke and `tools/qemu-iommu-remapping-smoke.sh` now
assert both `hostile` proof lines as `proof_result=ok` and forbid regression
to the deferred, `not-reached`, or `fail-closed` labels. IOVA export stays
disabled (`iova_export=disabled-this-slice`), no host-physical address is
user-visible (`host_physical_user_visible=0`), and no hostile-hardware
isolation is claimed (`hostile_hardware_isolation=not-claimed`).

Implementation note, 2026-05-23 (domain-scoped IOVA export): the selected QEMU
Intel production `DMAPool` path now exposes the mapped request-buffer IOVA
through `DMABuffer.info` only while the matching active `DmaBufferHandle`
generation is live. The schema fields are `deviceIova`,
`deviceIovaScope=domain-scoped-iova`,
`deviceIovaMeaning=meaningless-outside-domain`, and
`iovaExport=domain-scoped-only`; the production remapping proof asserts that
`deviceIova=0x200000` matches the installed second-level mapping. After the
buffer is freed and the pool is released, an export attempt on the same handle
fails closed with `side-effect-blocked`. The bounce-buffer grant path still
reports `deviceIova=0`, `deviceIovaScope=none`,
`deviceIovaMeaning=not-exported`, and `iovaExport=disabled-future-only`.

Implementation note, 2026-05-23 21:34 UTC (production-path hostile smokes):
the selected QEMU Intel path now emits and asserts `iommu-remapping:
production dmapool hostile proof` over the active manager-owned `DMAPool` /
`DMABuffer` ledger. The proof ties the raw VT-d stale-handle and
stale-completion faults to the production mapped IOVA, synthetic stale
pool/buffer generation mismatch candidates, post-teardown stale-handle export
failure, and per-device cross-domain boundary. It covers stale IOVA after
revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion
after reset, teardown-under-DMA ordering, and cross-domain stale-handle
attempts; no second-level entry is installed for stale authority, no CQ entry
is published, no new-owner memory is exposed, and page reuse stays ordered
after invalidation completion. It does not claim a process-exit trigger for
the IOMMU path; the existing `make run-net` bounce-buffer evidence remains the
exit-under-DMA source. The same smoke asserts the
`complete_iommu_dmapool_mapping_teardown` prerequisite-false return and the
`hold_iommu_dmapool_mapping_ledger_after_abort` path as fail-closed branch
evidence. Existing bounce-buffer S.11.2 evidence from `make run-net` and
`make run-dmapool-grant` is preserved unchanged.

Implementation note, 2026-05-26 05:55 UTC (direct-DMA posture transition for
the selected QEMU Intel path): the closeout slices above landed the full
mechanism — real hardware DMA over a manager-owned `DMAPool` `DMABuffer` page
mapped through the per-device IOMMU domain, domain-scoped IOVA export, per-device
domains, and the production hostile matrix — but deliberately deferred the
headline `direct_dma=enabled` claim behind `iova_export=disabled-this-slice`.
The selected QEMU Intel path now emits `iommu-remapping: direct-dma posture
real_dma=attempted direct_dma=enabled remapping_tables=programmed
trusted_domain=<domain-id> descriptors_reference=domain-scoped-iova
mapped_page_source=manager-owned-dmabuffer mapping_installed_before_doorbell=true
invalidated_before_page_reuse=true bounce_buffer=not-required
exported_device_addresses=iova-only host_physical_user_visible=0
hostile_hardware_isolation=not-claimed proof_result=ok`. Every field is computed
from the real proof facts, not asserted as a constant: `remapping_tables=programmed`
requires the root/context/second-level entries written plus the SRTP/TES
handshakes; `real_dma=attempted` requires the virtio-rng device's mapped DMA to
have completed through the programmed IOVA (`hardware_dma_translation_proven`);
`direct_dma=enabled` additionally requires the manager-owned `DMAPool` mapping to
have been installed before the device doorbell; and `invalidated_before_page_reuse`
folds in the two-phase revocation's `page_reuse_ordered_after_invalidation` and
`invalidation_completed` results. This is bounded QEMU-emulator evidence, so
`hostile_hardware_isolation` stays `not-claimed` (real hostile-hardware isolation
needs real hardware, not QEMU). The no-IOMMU `run-net` / `run-dmapool-grant`
bounce-buffer fallback is untouched: it keeps `direct_dma=blocked` with no IOVA
export, and `make run-iommu-remapping` now forbids this path from regressing to
the bounce-buffer fallback proof or to a blocked/not-attempted posture. The
contract table in "Downstream-Contract Scaffolding" (direct-remapping domain:
`direct_dma=enabled`, `remapping_tables=programmed`,
`exported_device_addresses=iova-only`) is now backed by an emitted, asserted
posture on the selected path.

## Authority Model

Device authority is split into three independent capabilities:

- `DMAPool`: authority to allocate, expose, and revoke device-visible memory
  within a kernel-owned physical range or IOMMU domain.
- `DeviceMmio`: authority to map and access one device's register windows.
- `Interrupt`: authority to wait for and acknowledge one interrupt source.

Holding one of these capabilities never implies the others. A driver needs all
three for a normal device, but the kernel and init can grant, revoke, and audit
them separately.

### Production Handle Epoch Invariants

All three object families use opaque handles whose identity is checked against
kernel-owned records before every operation. A raw object id is never enough to
authorize DMA, MMIO, interrupt waits, acknowledgements, descriptor submission,
or teardown. A handle is accepted only when all of these facts match in the
same ownership transaction:

- the object id resolves to a live record of the expected type;
- the handle's device owner generation matches the current device-manager
  owner record;
- the handle's pool, mapping, slot, source, or route generation matches the
  current reusable subrecord;
- the record state permits the requested operation.

The exact ABI shape may change when the capability surface is implemented, but
production handles must carry the equivalent identity:

```rust
struct DmaPoolHandle {
    device_id: u32,
    owner_generation: u64,
    pool_id: u32,
    pool_generation: u64,
}

struct DmaBufferHandle {
    device_id: u32,
    owner_generation: u64,
    pool_id: u32,
    pool_generation: u64,
    slot: u32,
    slot_generation: u64,
}

struct DeviceMmioHandle {
    device_id: u32,
    owner_generation: u64,
    bar: u8,
    mapping_id: u32,
    mapping_generation: u64,
}

struct InterruptHandle {
    device_id: u32,
    owner_generation: u64,
    source_id: u32,
    source_generation: u64,
    route_generation: u64,
}
```

Object identity fields have distinct jobs:

- `DMAPool` handles name the claimed device, the device owner generation, and
  the pool record generation. Buffer handles issued by the pool repeat the
  device-owner and pool identity and additionally name a buffer slot and slot
  generation. The pool identity prevents a handle from crossing devices or
  owners; the slot identity prevents a freed or reused buffer slot from
  accepting an old descriptor, free, or completion.
- `DeviceMmio` handles name the claimed device, owner generation, BAR or
  subrange mapping record, and mapping generation. The physical range, cache
  attributes, and access policy remain in the kernel record and are not
  user-editable handle fields.
- `Interrupt` handles name the claimed device, owner generation, source record,
  source generation, and route generation. Waiter records may carry their own
  waiter generation internally, but they must be invalidated whenever the
  source or route generation changes.

Owner generations and subrecord generations are intentionally separate. The
device owner generation belongs to the device-manager ownership record and
invalidates every `DMAPool`, `DeviceMmio`, and `Interrupt` handle for the old
owner when ownership is revoked, transferred, reset, or reassigned. Pool,
buffer-slot, MMIO-mapping, interrupt-source, and route generations belong to
records that may be reused below the device owner. They prevent stale buffer,
mapping, route, waiter, and completion handles from matching a newly allocated
subrecord even when the device id or pool id is reused.

Every epoch is non-wrapping for authority purposes. Implementations must use an
epoch width that cannot wrap during the object's lifetime, or permanently retire
the exhausted device, pool, slot, mapping, source, or route record. Epoch
exhaustion is a closed allocation or reassignment failure; it must never wrap
back to a value that could match an old handle.

Generation mismatch, wrong object type, wrong device owner, freed slot, detached
source, revoked mapping, and wrong device-owner state are hard closed results.
The failed operation must not program a descriptor, ring a doorbell, perform an
MMIO write, unmask or acknowledge an interrupt, wake a waiter, publish a CQE,
decrement completion accounting, free a page for reuse, or mutate the device
ledger except for bounded failure accounting or audit metadata.

Transfer, revoke, reset, and reassignment are ordered around those epoch checks:

- **Transfer:** The old owner leaves `Active`, the owner generation advances,
  and old handles become invalid before a new owner receives handles. A transfer
  may preserve hardware state only after old interrupt notifications, MMIO
  write authority, and DMA submissions are either quiesced or represented by
  old-generation ledger entries that the new owner cannot consume.
- **Revoke:** The device manager invalidates user-visible handles first, then
  follows the revocation order below: MMIO write authority removed, interrupts
  masked or detached, queues quiesced or reset, mappings removed, and pages
  scrubbed before release.
- **Reset:** Reset or disable advances the owner generation and any affected
  source, route, pool, mapping, and buffer-slot generations before new handles
  can be issued. If old DMA writes cannot be proven stopped, buffer slots stay
  unavailable until reset completion and mapping invalidation prove reuse is
  safe.
- **Reassignment:** Interrupt sources, MMIO mappings, and DMA pool records are
  detached or unmapped, their subrecord generations advance, pending waiters or
  completions are drained or marked stale, and only then can a new owner receive
  authority for the reused source, mapping, or slot.

Handle reuse rules:

- stale handles fail closed;
- freed-handle reuse fails closed;
- reallocated slots must not restore authority to old handles;
- old interrupt waiters must not observe or acknowledge a new owner's
  interrupt source;
- old DMA handles must not reference a newly allocated buffer in the same slot.

Production proof obligations are split between host tests and QEMU smokes.
Host tests must cover the pure validator and state-machine cases: stale owner
generation, stale pool or mapping generation, stale buffer slot, stale interrupt
source or route, wrong owner, wrong device, wrong object type, freed object,
wrong state, epoch exhaustion/retirement, and no side effects on failure. QEMU
smokes must prove the hardware-facing ordering: stale DMA handles after
free/reuse cannot submit descriptors, stale DMA completions after revoke/reset
cannot publish CQEs or mutate reused buffers, stale MMIO handles cannot ring
doorbells after revoke, stale interrupt waiters or acknowledgements cannot wake
or affect a new owner, and process-exit or driver-crash teardown reaches a
zero-live ledger before pages are reused. These production handles and proofs
remain open; the current QEMU scratch proofs are prerequisite evidence for this
contract, not completion of it.

Implementation note, 2026-05-02 13:18 UTC: `capos-lib::device_authority` now
implements the bounded pure host-testable validator prerequisite for these
handle epoch invariants. The module models the documented handle and record
identity fields for `DMAPool`, DMA buffer, `DeviceMmio`, and `Interrupt`,
separates device-owner generations from pool, slot, mapping, source, and route
generations, returns explicit fail-closed error labels, blocks the relevant
side-effect class on validation failure, and refuses epoch wrap or retired
epoch reuse. This does not expose production userspace handles, wire kernel
device paths, attach budget/OOM policy to real handle creation, or complete
the QEMU stale-handle or S.11.2 hostile-smoke gates.

Implementation note, 2026-05-03: the pure host-test operation matrix now
enumerates every current validator operation variant:
`DMAPool::{AllocateBuffer,IssueBufferHandle}`,
`DMABuffer::{SubmitDescriptor,CompleteDescriptor,FreeBuffer}`,
`DeviceMmio::{Map,Read,Write,RingDoorbell,Unmap}`, and
`Interrupt::{Wait,Acknowledge,Mask,Unmask}`. Each row asserts active
acceptance plus stale owner/subrecord, freed, revoked, and retired failures
with the exact blocked side-effect class for that operation. This remains
ABI-independent host-test evidence only; it does not create production
userspace handles or replace the QEMU stale-handle and S.11.2 hostile-smoke
gates.

Implementation note, 2026-05-02 13:43 UTC: the current kernel device-manager
`DMAPool` lifecycle and imported-live accounting proofs now adapt their BDF,
owner generation, pool slot, and pool generation into
`capos-lib::device_authority` records. The QEMU proof records active validator
success, stale-after-revoke failure as `dmapool-stale-handle`, the validator
reason `stale-owner-generation`, and `side-effect-blocked`. This is still a
bounded kernel-proof adapter, not production userspace handle exposure,
`DeviceMmio`/`Interrupt` handle wiring, production page cleanup, or S.11.2
hostile smoke completion.

Implementation note, 2026-05-02 17:04 UTC: the current kernel device-manager
also has a bounded manager-owned `DeviceMmio` record proof adapter. The record
carries BAR, mapping id, mapping generation, and owner generation fields, and
the QEMU `virtio-rng` device-manager path validates a `RingDoorbell` operation
through `capos-lib::device_authority`. After revoke begins, the old handle
fails through the pure validator as `stale-owner-generation`, records
`devicemmio-stale-handle` plus `side-effect-blocked`, and no doorbell write is
attempted. The lifecycle proof blocks `RevokingHandles -> MmioRevoked` while
the record is attached, then allows the transition after bounded detach. This
does not expose production userspace `DeviceMmio` authority, program real BAR
mappings, create mapping objects, or complete hostile stale-MMIO smokes.

Implementation note, 2026-05-02 17:29 UTC: the bounded `DeviceMmio` adapter now
derives the proof mapping from the first decoded PCI memory BAR on the tested
`PciDevice` through the shared BAR-region validator. The attached manager-owned
record carries that BDF/BAR/base/length metadata and validates that it is the
same BDF, a memory BAR, nonzero length, and the same BAR named by the handle
before constructing the pure `capos-lib::device_authority` record. The QEMU
smoke asserts `region_source=pci-decoded-memory-bar`,
`region_bound_to_manager=true`, `bar_present=true`, `bar_memory=true`,
`bar_base`, and `bar_length`. This is still prerequisite evidence only: it does
not create userspace `DeviceMmio` handles, program real MMIO mappings, enforce
cache attributes or write policy, or write a real doorbell.

Implementation note, 2026-05-02 17:54 UTC: the same bounded `DeviceMmio`
adapter now records fail-closed malformed-region evidence before the positive
attach. Wrong-BDF metadata, wrong BAR/handle mismatch, and zero-length region
metadata all report `devicemmio-region-invalid`, with
`region_invalid_mapping=not-created` and
`region_negative_side_effect=side-effect-blocked`; the proof still records
`real_mmio_mapping=not-programmed` and `real_doorbell=not-written`. This is
bounded manager-proof evidence only. It does not create userspace
`DeviceMmio` handles, map real BAR pages, enforce cache attributes or write
policy, complete hostile stale-MMIO smokes, or perform a real doorbell write.

Implementation note, 2026-05-02 20:14 UTC: the bounded `DeviceMmio` adapter now
stores future mapping policy metadata on the attached manager-owned record and
reads it back through the active record plus matching `DeviceMmioHandle`. The
proof line records `policy_scope=manager-attached-devicemmio-cache-write-policy`,
`cache_policy=device-uncacheable`,
`page_table_protection=capability-scoped-device-nx`,
`write_policy=claimed-registers-and-doorbells-only`, `executable=blocked`,
`userspace_devicemmio=not-started`, `host_physical_user_visible=0`,
`policy_bound_to_manager=true`, and `policy_result=ok`. A tampered cache/write
policy record fails closed with `policy_tamper_result=fail-closed`,
`policy_tamper_mapping=not-created`, and
`policy_tamper_side_effect=side-effect-blocked`. This is still metadata proof
only: no PAT/MTRR or page-table programming is performed, no userspace
`DeviceMmio` handle is created, no real BAR mapping object exists, and no
doorbell is written.

Implementation note, 2026-05-02 20:45 UTC: while the same bounded
manager-owned `DeviceMmio` record is still active, the proof now validates
hostile `RingDoorbell` handles through a proof-scoped adapter that uses the
already-attached record rather than manager lookup short-circuits. Wrong owner
generation, wrong mapping generation, wrong mapping id, wrong BAR, and wrong
BDF/device fail closed with exact pure-validator reasons
`stale-owner-generation`, `stale-mapping-generation`, `wrong-mapping`,
`wrong-bar`, and `wrong-device`. Each records `side-effect-blocked`; the proof
also records that the attached manager record is preserved, no fake mapping is
created, and no doorbell is written. This remains bounded proof evidence only:
production userspace handles, real MMIO mappings, real cache/write-policy
enforcement, and hostile stale-MMIO smokes remain open.

Implementation note, 2026-05-03 00:36 UTC: the schema and kernel now include a
result-only `DeviceMmio.info` skeleton that can wrap a manager-issued
`DeviceMmioHandle`. The object validates the live device-manager record
through `validate_devicemmio_record()` before returning status labels such as
`userspaceDeviceMmio=manager-issued-skeleton`,
`managerRecord=validated-active`, `realMmioMapping=not-programmed`,
`realDoorbell=not-written`, `hostPhysicalUserVisible=false`,
`directMmio=blocked`, `registerRead=blocked`, `registerWrite=blocked`, and
`bootstrapGrant=blocked`. The QEMU device-manager lifecycle proof constructs
that cap object while the attached record is active, records
`devicemmio_cap_info_result=ok`, exercises the serialized
`CapObject::call(0, &[])` path and decodes the returned `DeviceMmio.info`
Cap'n Proto result as `devicemmio_cap_serialized_call_result=ok`, then
verifies the same cap fails closed after revoke begins as
`devicemmio_cap_stale_after_revoke_result=devicemmio-stale-handle`; the same
stale object also fails the serialized method-0 path as
`devicemmio_cap_serialized_stale_after_revoke_result=invoke-failed`. A later
manifest-grant smoke explicitly releases the granted `DeviceMmio` cap through
`CAP_OP_RELEASE` and proves a subsequent typed `DeviceMmio.info` call fails
closed from userspace. A focused grant-cycle smoke now repeats that grant,
release, and stale-info proof twice in sequence and asserts the second
manager-grant-source acquire receives a fresh mapping generation after the
first release; the same smoke also decodes both acquire/release cycles through
the typed volatile `HardwareAuditLog.snapshot` surface. The focused
hardware-audit interrupt-waiter smoke also decodes recent boot-time
`DmaBuffer`, `DmaPool`, and `Interrupt` `driver-crash` / `reset-disable`
lifecycle records from the current volatile 16-record snapshot window. The same
smoke now uses the `startSequence` cursor to decode older retained
`DeviceMmio` lifecycle rows that the default latest 16-record tail has skipped.
A `2026-05-10 15:33 UTC` manifest-grant follow-up turns
`DeviceMmio.map` from admission-only into a read-only userspace VMA over the
boot-preseeded BAR page already used by brokered `read32`. The typed smoke
validates the active `DeviceMmioOperation::Map` authority check, rejects
writable, executable, unknown, zero-size, unaligned, out-of-BAR, and overflow
requests with typed no-side-effect results, reads the same QEMU BAR value
through the returned userspace address and `DeviceMmio.read32`, rejects a
duplicate active map, explicitly calls `DeviceMmio.unmap`, proves a second
unmap is a typed no-op, remaps, and proves stale `unmap` fails closed after cap
release. Release/drop/driver-crash/reset-disable cleanup revokes any borrowed
user VMA before detaching the manager record. This is read-only BAR VMA
evidence only: it does not add writable MMIO, volatile register writes,
doorbells, host physical/IOVA exposure, post-userspace kernel MMIO mappings,
IOMMU programming, durable/signed audit persistence, concurrent sharing
semantics, or a production provider-driver consumer. A `2026-05-10 20:06 UTC`
follow-up promotes `DeviceMmio.write32` to one bounded brokered volatile dword
write through that same boot-preseeded kernel MMIO cache after active
manager-attached handle, owner/state, policy/region, pure Write authority,
dword alignment, decoded-BAR range validation, and a single provider-scoped
claim derived from PCI MSI-X metadata, including BDF, BAR, BAR base, offset,
and value. The focused proof writes the claimed virtio-rng MSI-X entry-0
vector-control mask dword,
reads it back through both brokered `read32` and the read-only userspace VMA,
then proves an unclaimed
message-address dword write leaves the original value unchanged. Invalid range
and unclaimed calls remain typed no-write results, while stale or released
handles fail closed before any write and do not return a `write32` result
payload. This does not add writable userspace BAR mappings, arbitrary register
writes, doorbells, host physical/IOVA exposure, post-userspace arbitrary
remaps, IOMMU programming, or a production provider-driver consumer.

A `2026-05-26 07:32 UTC` follow-up
(`ddf-userspace-writable-devicemmio-interrupt`) proves the cross-authority
non-implication required by the gate ("holding one authority must not imply
either of the others"). Each grant smoke's granted `CapSet` is its sole
authority source: the `DeviceMmio` smoke holds only `console` + `device_mmio`
and the `Interrupt` smoke only `console` + `interrupt`. The smokes assert that
the other DDF grants (`interrupt`/`device_mmio` and `dmapool`) are absent from
the `CapSet`, and that the held cap cannot be reinterpreted as another
interface because the kernel-delivered interface id is fixed at grant time
(`negative-authority ... result=ok` lines, with the kernel "spawned ... 2 caps"
structural counterpart). This is a non-implication proof over the
already-landed authorities; it adds no new kernel MMIO/IRQ/DMA surface. Real
userspace `wait`/`acknowledge` over the live route with deferred LAPIC EOI is
proven separately by the provider `tx_interrupt`/`rx_interrupt` consumer
(`make run-ddf-provider-consumer`).

## `DMAPool` Invariants

`DMAPool` is the only future userspace-facing authority that may cause a
device-visible DMA address to exist.

- **Authority:** A holder may allocate buffers only from the pool object it
  was granted. It may not request arbitrary physical frames, import caller
  virtual memory by address, or derive another pool.
- **Handle identity:** A pool operation checks the claimed device id, owner
  generation, pool id, and pool generation before changing pool state. Buffer
  operations additionally check the buffer slot and slot generation before
  descriptor validation, completion accounting, free, scrub, or reuse.
- **Physical range:** Every exported device address must resolve to pages
  owned by the pool. The kernel records the allowed host-physical page set and
  validates every descriptor mapping against that set before a device can use
  it. If an IOMMU domain backs the pool, the exported address is an IOVA, not
  raw host physical memory.
- **Ownership:** Each DMA buffer has one pool owner, one device-domain owner,
  and explicit CPU mappings. Sharing a buffer with another process requires a
  later typed memory-object transfer; copying packet data is the default until
  that object exists.
- **No raw grants:** Userspace never receives an unrestricted host-physical
  address. A driver may receive an opaque DMA handle or an IOVA meaningful
  only to its `DMAPool`/device domain. It cannot turn that value into access
  to unrelated RAM.
- **Residency:** DMA pages are committed before exposure to the device,
  resident for the entire device-visible lifetime, unswappable, and scrubbed
  before reuse by another owner.
- **Bounds:** Buffer length, alignment, segment count, and queue depth are
  bounded by the pool. Descriptor chains that point outside an allocated
  buffer, wrap arithmetic, exceed device limits, or reference freed buffers
  fail closed before doorbell writes.
- **Revocation:** Revoking the pool first quiesces the device path using it,
  prevents new descriptors, waits for or cancels in-flight descriptors, then
  removes IOMMU mappings or invalidates bounce-buffer handles before freeing
  pages.
- **Reset:** If in-flight DMA cannot be proven stopped, revocation escalates
  to device reset through the owning device object before pages are reused.
- **Residual state:** Pages returned from a pool are zeroed or otherwise
  scrubbed before reuse by a different owner. Receive buffers are treated as
  device-written untrusted input until validated by the driver or stack.

Device-visible memory authority is not ordinary `MemoryObject` authority.
`FrameAllocator` and `MemoryObject` must not become raw physical-address escape
hatches. A future shared-buffer transfer may share CPU-visible packet bytes
after validation, but it does not by itself grant IOVA creation, descriptor
programming, or device write authority.

For the in-kernel QEMU smoke, the kernel is the only `DMAPool` holder. The
same invariants apply internally even though no userspace capability object is
exposed yet.

Implementation note, 2026-04-24: the initial virtio-net transport uses
kernel-owned frame-allocator pages for RX/TX split-virtqueue descriptor,
available, and used rings plus the one-shot TX descriptor proof buffer, ARP TX
buffer, ICMP TX buffer, smoltcp adapter/TCP TX buffers, and posted RX packet
buffers. The smoltcp adapter copies completed RX frame bytes out of those
device-written pages before handing them to the stack. Those pages are
programmed into the device only by kernel code after modern PCI transport
discovery and feature negotiation; no userspace process receives a DMA buffer,
physical address, or BAR mapping.

Implementation note, 2026-05-02: the current QEMU virtio-net DMA path routes
those kernel-owned pages through a bounded `device_dma` pool ledger. The net
smoke proves live pool bytes, page counts, page-rounded MMIO mapping bytes,
config/RX/TX interrupt holds, RX/TX ring depths, and RX/TX
submission/completion and in-flight descriptor accounting. This is the first
kernel-owned `DMAPool` accounting proof; it does not expose userspace DMA,
MMIO, or interrupt handles and does not complete the production S.11.2
hostile-smoke gate.

Implementation note, 2026-05-02 06:59 UTC: the kernel-owned `device_dma`
ledger now has an explicit bounded budget/OOM policy for the current
virtio-net proof path: 32 DMA pages, 131072 DMA bytes, queue depth 8,
submission depth 8, four page-rounded MMIO mapping holds, 16384 MMIO mapped
bytes, and three interrupt holds. `make run-net` emits a scratch-ledger
`device-dma: budget oom proof ... proof_result=ok` line proving page and byte
allocation over budget, overlarge queue depth, duplicate and over-budget MMIO
holds, MMIO byte over budget, duplicate and over-budget interrupt holds, and
descriptor submission beyond queue depth all fail closed without mutating the
live virtio-net ledger; the proof also revalidates the live ledger. This is a
bounded prerequisite for the production `DMAPool` contract. It still does not
expose userspace `DMAPool`, `DeviceMmio`, or `Interrupt` handles, wire real
lifecycle hooks, program IOMMU remapping domains, or close the S.11.2 hostile
smoke matrix.

Implementation note, 2026-05-02 16:31 UTC: attached device-manager `DMAPool`
records now also carry that budget profile. The lifecycle and imported
live-accounting proofs read page, byte, queue-depth, submission-depth, MMIO
mapping, MMIO byte, and interrupt-hold budgets through the active
device-manager record plus the matching `DmaPoolHandle`. Queue and submission
depth remain per-queue limits; the manager proof records `queue_count=2` plus
derived aggregate in-flight/submission budgets and checks imported virtio-net
accounting against those aggregate totals. This keeps the budget policy tied
to manager state but still does not create userspace `DMAPool` handles or
enforce budgets at userspace handle creation, transfer, or revoke.

Implementation note, 2026-05-02 21:13 UTC: the zero-live device-manager
`DMAPool` lifecycle proof now validates a proof-scoped tampered
`AttachedDmaPoolBudgetPolicyRecord` through the manager budget-policy helper
while the attached pool record is active. The tampered record uses the wrong
policy scope/source/label plus stricter page, byte, queue, in-flight, MMIO,
interrupt, and submission budgets, and it fails closed before it can be
treated as a usable policy. The QEMU proof records
`budget_policy_tamper_result=fail-closed`,
`budget_policy_tamper_allocation=not-created`,
`budget_policy_tamper_ledger=not-mutated`,
`budget_policy_tamper_teardown=not-advanced`, and
`budget_policy_tamper_side_effect=side-effect-blocked`. This is still
bounded metadata proof only: no userspace `DMAPool` handle is exposed, no
production userspace DMA page is allocated, freed, or reused, and no real DMA
teardown is claimed.

Implementation note, 2026-05-02 22:16 UTC: the manager-owned `DMAPool`
budget-accounting proof now fails closed on accounting over the attached
budget instead of only logging passive booleans. The positive zero-live and
imported-live `budget_*_within_policy=true` labels call a helper that
revalidates the active attached record, matching `DmaPoolHandle`, owner, active
state, and attached budget policy before accepting the record accounting.
While the zero-live record remains active, synthetic attached-accounting
candidates exceed buffer count, page count, byte count, and the current
aliased in-flight/submission total; each candidate fails closed before it can
be treated as usable manager state. The QEMU proof records exact overrun
reasons, no fake allocation, no ledger mutation, no teardown advancement, and
side-effect blocking. A proof-scoped over-budget attach candidate now fails
before pool generation allocation and records preserved generation state. This
remains bounded manager-record evidence only; later grant slices add the first
single-page bounce-buffer allocation/free authority, but multi-buffer
allocation, DMA mapping, descriptor execution, IOMMU programming, production
driver consumption, and S.11.2 hostile smokes remain open.

Implementation note, 2026-05-02 23:59 UTC, updated 2026-05-03: the schema and
kernel now include a result-only `DMAPool.info` skeleton that can wrap a
manager-issued `DmaPoolHandle`. The object validates the live device-manager
record through `validate_dmapool_record()` before returning status labels such as
`userspaceDmaPool=manager-issued-bounce-buffer`, `realDma=not-attempted`,
`hostPhysicalUserVisible=false`, and `directDma=blocked`. The QEMU zero-live
device-manager lifecycle proof constructs that cap object while the attached
record is active, records `dmapool_cap_info_result=ok`, exercises the
serialized `CapObject::call(0, &[])` path and decodes the returned
`DMAPool.info` Cap'n Proto result as
`dmapool_cap_serialized_call_result=ok`, then verifies the same cap fails
closed after revoke begins as
`dmapool_cap_stale_after_revoke_result=dmapool-stale-handle`; the same stale
object also fails the serialized method-0 path as
`dmapool_cap_serialized_stale_after_revoke_result=invoke-failed`. The same
proof now exercises `DMAPool.allocateBuffer` through `call_with_table()` on a
real `DmaPoolCap` in a `CapTable`: it decodes `bufferIndex=0`, verifies
`CAP_CQE_TRANSFER_RESULT_CAPS`, `cap_count=1`, the transfer-result record's
`DMABUFFER_INTERFACE_ID`, the non-transferable same-session result-cap hold,
and `DMABuffer.info` through the returned result cap. Duplicate allocation and
stale-after-revoke allocation both fail closed without adding result caps. The
duplicate-active valid-size path now reports a structured schema result with
`result=dmapool-already-attached`, `reason=active-buffer-attached`,
`sideEffect=side-effect-blocked`, and `bufferPresent=false`; the duplicate path
also preserves the manager generation counter. Invalid-size requests use the
same no-result-cap response shape with
`result=dmapool-allocation-request-invalid`, the exact request reason,
`sideEffect=side-effect-blocked`, and `bufferPresent=false`.
As of the 2026-05-08 `DMAPool` grant-source follow-up, the same bounded path is
also available through `KernelCapSource::DmaPool`: the grant source attaches a
fresh zero-live manager-owned pool record, stages matching zero-live release
evidence, and lets the child mint one `DMABuffer` result cap. A 2026-05-09
release-order follow-up has the smoke explicitly release the parent `DMAPool`
before the result `DMABuffer`; the parent detach remains pending until the
`DMABuffer` frees the page and completes the staged zero-live pool detach. A
later 2026-05-09 follow-up adds a typed `DMABuffer.freeBuffer` method for that
bounded result cap: the method reuses the same FreeBuffer authority validation
and page scrub/ledger/frame-free cleanup path as cap release, emits a
`free-buffer` audit event, invalidates later `DMABuffer.info`, and makes the
later cap release a no-op detach. A second bounded follow-up keeps the parent
`DMAPool` live after that first explicit free, reallocates the same slot with a
fresh `slotGeneration`, and then repeats the parent-first release proof on the
second buffer. The focused read-side `HardwareAuditLog.snapshot` smoke also
decodes both slot generations, both typed `free-buffer` records, the parent
`DmaPool` release, and both no-op release-after-free records through the
volatile audit cap. The `run-net`
`DMABuffer`
driver-crash and reset-disable proofs also cover the same pending-parent
completion path so successful buffer cleanup cannot orphan the staged parent
release. A later admission follow-up adds typed
`DMABuffer.submitDescriptor` to the same manifest-granted bounce-buffer path:
the method validates the active manager-attached buffer epoch through the
existing `DmaBufferOperation::SubmitDescriptor` authority validator, echoes
the queue/descriptor/length and generation identity, and proves the same call
fails closed after `freeBuffer` revokes the old cap. A later symmetric
follow-up adds typed `DMABuffer.completeDescriptor` to the same bounded path.
The 2026-05-10 request-shaping follow-up routes both typed descriptor calls
through a shared pure bounded descriptor validator: valid bounce-buffer
requests return `ok` request labels plus `queue_count=4`,
`descriptor_count=8`, and `buffer_bytes=4096`, while out-of-range
queues/descriptors, zero submit lengths, submit lengths beyond the bounce
buffer, and completion lengths beyond the bounce buffer fail closed as
`dmabuffer-descriptor-request-invalid` with `side-effect-blocked` before any
descriptor side effect. A later manager-accounting follow-up records only the
bounded manager counter: submit returns `manager-inflight-recorded` and
raises `DMAPool.info` `live_inflight` to `1`, completion returns
`manager-inflight-completed` and restores it to `0`, and a valid completion
with no outstanding submission returns `dmabuffer-no-inflight-submission`.
Too-small descriptor result buffers are rejected before accounting mutation,
and cap-table release drains bounded in-flight accounting before detaching the
bounce buffer. The 2026-05-10 06:37 UTC follow-up makes this
`allocateBuffer`/`freeBuffer` page lifecycle the first production-labeled
single-page bounce-buffer allocation/free authority. The typed surfaces report
`userspaceDmaPool=manager-issued-bounce-buffer`,
`allocation=single-bounce-buffer-page`,
`recordPool=userspace-bounce-buffer-live`,
`zero-live-dmapool-bounce-buffer`, and `freeBuffer=bounce-buffer-page`; the
underlying `device_dma` ledger uses a manager-attached bounce-buffer helper and
scrubs before frame free. A 2026-05-10 11:44 UTC follow-up extends that bounded
manifest-granted path to two fixed manager-owned slots; a 2026-05-10
12:49 UTC follow-up extends the same path to three fixed manager-owned slots:
slot 0, slot 1, and slot 2 can be live together, `DMAPool.info` reports three
live buffers/pages while all are attached, a fourth allocation fails closed as
`dmapool-already-attached`, and slot 0 can be freed and reused with a fresh
generation while slots 1 and 2 remain live. There is still no allocation
beyond those three fixed slots, real device-visible DMA mapping, host physical
address or IOVA exposure, BAR mapping, production descriptor-ring mutation, CQ
publication, IOMMU programming, or production driver consumer.
Stale allocation attempts preserve the live backing, and page allocation
failure occurs before buffer-generation allocation so it does not burn a
generation.
The 2026-05-10 13:45 UTC follow-up (`3bbeb3d4`) adds explicit typed
`DMABuffer.unmap` for the mapped bounce-buffer userspace VMA. The method
validates the live `DMABuffer` handle before reporting success or no-op,
removes only the borrowed VMA owned by that mapping for the calling process,
and publishes the mapping as absent only after the borrowed-range ownership
check, page-table unmap, TLB wait, and waiter cleanup succeed. While teardown
is in progress, concurrent map/free/release paths fail closed against an
in-progress mapping state. A second unmap returns `dmabuffer-mapping-absent` /
`no-user-mapping` with no side effect. This is userspace VMA cleanup only: it
does not free or scrub the bounce page, detach the buffer record, change
`DMAPool.info` live buffer/page/in-flight counts, program or remove real DMA or
IOMMU mappings, expose host physical/IOVA addresses, mutate descriptor rings,
publish CQ entries, or add a production driver consumer.
The 2026-05-10 14:12 UTC follow-up moves bounded descriptor accounting from a
single pool-global descriptor identity to per-slot state on each live
manager-owned `DMABuffer` record. `DMAPool.info live_inflight` remains the
aggregate sum across live slots. A valid submit on slot 0 and a valid submit
on slot 1 can coexist; duplicate submit on either same slot still fails
closed; mismatched completion preserves that slot's descriptor without
touching other slots; matching completion of slot 0 decrements the aggregate
while slot 1 remains in flight; explicit `freeBuffer` of an in-flight slot
fails closed; and cap-release/process-exit cleanup drains only the releasing
slot's descriptor before detach. This is still bounded manager accounting and
does not mutate descriptor rings, publish CQ entries, expose host physical or
IOVA addresses, attempt direct DMA, program IOMMU state, or add a production
driver consumer.
The 2026-05-10 18:11 UTC follow-up makes the single manager-owned
bounce-buffer page exclusive between userspace borrowed-VMA ownership and
manager in-flight descriptor ownership for each `DMABuffer` cap. A valid
submit while the same cap still has a live mapping fails closed as
`dmabuffer-mapping-live` / `user-mapping-live` before manager in-flight
accounting changes; explicit unmap restores submit. A valid map while the
slot has an in-flight descriptor fails closed as `dmabuffer-inflight-submission`
/ `in-flight-submission`, returns `addr=0`, and does not publish a borrowed
VMA; matching completion restores map for that slot. The lock order remains
cap mapping state before manager validation, with no address-space lock held
across manager state mutation.
The 2026-05-10 19:29 UTC follow-up adds bounded completion data on that same
manager-owned bounce-buffer page. The successful matching
`DMABuffer.completeDescriptor` path keeps the existing
`manager-inflight-completed` result labels, validates the active owner,
pool/slot generation, queue/descriptor identity, and submitted length, then
writes a deterministic byte pattern into only the accepted `completionLength`
bytes before clearing the in-flight record. Because submit is blocked while a
cap-owned borrowed VMA is live and map is blocked while the slot is in flight,
the write happens while no live user VMA exists for that slot; a later
successful map lets userspace read the pattern. Invalid requests, stale caps,
no-inflight completions, descriptor mismatches, length-exceeded completions,
mapped-live cases, and after-free calls do not write. This is
userspace-visible bounce-buffer completion data, not device DMA completion:
there is still no descriptor-ring mutation, CQ publication, direct DMA,
host physical/IOVA exposure, IOMMU programming, or production driver consumer.
Implementation note, 2026-05-10 04:40 UTC: duplicate-active valid-size and
invalid-size `DMAPool.allocateBuffer` requests now use schema result data for
domain rejection instead of an application-exception label. The no-result-cap
response reports either `result=dmapool-already-attached` /
`reason=active-buffer-attached` or
`result=dmapool-allocation-request-invalid` with the exact request reason, plus
`sideEffect=side-effect-blocked` and `bufferPresent=false`, before any
allocation side effect.
The 2026-05-10 live-accounting follow-up also carries that bounded frame into
the attached `DMAPool` record exposed by typed `DMAPool.info`: the manager
record starts as `zero-live-dmapool-bounce-buffer`, changes to
`userspace-bounce-buffer-live` with one, two, or three live 4096-byte pages while
manager-attached `DMABuffer` result caps exist, and returns to zero-live after
typed `DMABuffer.freeBuffer` or cap release scrubs/releases every live bounce
page. The same lifecycle proof validates that active manager-record accounting
against the attached budget policy before treating allocation as usable state.
This still does not expose a device-visible DMA address, IOVA, host physical
address, production descriptor side effect, DMA mapping, or production driver
consumer.

Implementation note, 2026-05-11 03:00 UTC: the manifest-granted
manager-owned fixed bounce-buffer `DMAPool` path now has its own
device-manager budget policy instead of importing the live virtio-net
`device_dma` policy. The policy covers three live buffers/pages, 12288 bytes,
four queues, eight descriptors per queue, one in-flight descriptor per live
slot, zero MMIO mappings/bytes, and zero interrupt holds.
`DMAPool.allocateBuffer` validates the next-live manager accounting against
that policy before slot selection, frame allocation, generation allocation,
result-cap minting, or manager ledger mutation. With all three fixed slots
live, a fourth valid-size allocation returns no result cap and reports
`result=dmapool-budget-exceeded`, `reason=over-buffer-budget`,
`sideEffect=side-effect-blocked`, and `bufferPresent=false`. Imported live
virtio-net proof records continue to use the `device_dma:virtio-net` budget
policy. This remains the bounded manager-owned bounce-buffer path: direct DMA
is blocked, host physical addresses and IOVAs stay hidden, descriptor rings
and completion queues are not mutated, and IOMMU/remapping plus production
driver consumption remain out of scope.

Implementation note, 2026-05-11 06:10 UTC: the same fixed-slot
manager-owned `DMAPool` family now revalidates current budget accounting
before publishing an allocated `DMABuffer` result cap, before acquire-audit
publication, before parent `DMAPool` release intent or detach, before
grant-rollback/drop/teardown detach, before pending-parent release completion,
before `DMABuffer` page release, and before descriptor-completion state
advancement. The focused grant smoke labels the full-pool budget rejection as
no leaked result cap, no generation burn, no ledger mutation, and no stale
authority publication. `DeviceMmio` and `Interrupt` budget propagation is not
changed by this slice.

Implementation note, 2026-05-03 02:05 UTC: the schema and kernel now include a
result-only `DMABuffer.info` skeleton that can wrap the manager-attached
`DmaBufferHandle` record already issued inside the zero-live `DMAPool`
lifecycle proof. The object validates the live manager-owned buffer record
through `validate_dmabuffer_record()` and the existing pure DMA-buffer
validator before returning status labels such as
`userspaceDmaBuffer=manager-issued-bounce-buffer`,
`managerRecord=validated-active`, `bufferRecord=manager-attached-buffer`,
`realDma=not-attempted`, `hostPhysicalUserVisible=false`,
`directDma=blocked`, `descriptorSubmit=manager-inflight-accounting`,
`descriptorComplete=manager-inflight-accounting`, `freeBuffer=bounce-buffer-page`, and
`bootstrapGrant=blocked`. The QEMU zero-live lifecycle proof constructs that
cap object while the buffer record is active, records
`dmabuffer_cap_info_result=ok`, exercises the serialized
`CapObject::call(0, &[])` path and decodes the returned `DmaBuffer.info`
Cap'n Proto result as `dmabuffer_cap_serialized_call_result=ok`, then verifies
the same cap fails closed after revoke begins as
`dmabuffer_cap_stale_after_revoke_result=dmabuffer-stale-handle`; the same
stale object also fails the serialized method-0 path as
`dmabuffer_cap_serialized_stale_after_revoke_result=invoke-failed`. Later
bounded grant slices expose this result cap through the `DMAPool`
manifest-grant path, add typed `DMABuffer.freeBuffer`, and add bounded
userspace bounce-buffer `DMABuffer.map` / `DMABuffer.unmap` plus
manager-accounted request-shaped `DMABuffer.submitDescriptor` /
`DMABuffer.completeDescriptor`; the path still exposes no real DMA,
descriptor-ring mutation, CQ publication, host physical address or IOVA export,
production page cleanup/reuse, or production userspace `DMABuffer` completion.

Implementation note, 2026-05-11 11:22 UTC: the provider-consumer smoke now
uses that same bounded bounce-buffer path to prove a
descriptor-ring-equivalent provider side effect. After
`DMABuffer.submitDescriptor` validates live owner/pool/slot authority,
descriptor queue/id/length bounds, no live user mapping, and no duplicate
in-flight descriptor, the manager scrubs the page and writes a provider-visible
shadow descriptor entry with magic, queue, descriptor id, submitted length, and
flags before writing the existing submit marker and committing the in-flight
record. `DMABuffer.completeDescriptor` still writes completion bytes only
inside the validated completion length, and the smoke proves the shadow entry
and marker remain visible outside that completion window. Provider-effect
submits shorter than the 24-byte shadow-descriptor-plus-marker footprint are
rejected as a typed no-side-effect boundary, even though the shared descriptor
request shape is otherwise valid. This is a bounded bounce-buffer side effect
only; no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA,
IOVA, host physical address, or remapping-domain claim is added.

Implementation note, 2026-05-11 12:01 UTC: the same submit path now replaces
the shadow-descriptor payload with a selected provider-owned queue entry plus
marker. The entry records queue magic, queue id, tail, descriptor id,
submitted length, and flags after descriptor authority validation and the
submit scrub; `make run-ddf-provider-consumer` maps the buffer after
completion and proves the queue entry and marker remain visible outside the
completion window. Provider-effect submits shorter than the current 72-byte
provider queue-entry-plus-marker footprint are rejected before in-flight
accounting or provider-visible mutation. This remains bounded bounce-buffer
evidence only:
no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA, IOVA,
host physical address, or remapping-domain claim is added.

Implementation note, 2026-05-12 20:30 UTC: the accepted
`DMABuffer.submitDescriptor` path now constructs the candidate in-flight
descriptor in manager-local state and validates the resulting
`DMAPool` budget/accounting before scrubbing or writing provider-visible
bounce-buffer bytes. The provider-consumer smoke snapshots the selected
provider queue-entry and marker bytes, drives a short provider-effect submit
rejection, and proves both those bytes and `live_inflight=0` are preserved.
The selected virtio-net TX publication gate remains separately bounded after
provider-entry write: quiesced publication still fails with no extra pin and
no doorbell, not with rollback of the already-written shadow entry.

Implementation note, 2026-05-11 14:39 UTC, branch commit `f04a14f4`: the
selected provider-owned queue entry now carries a staged claimed virtio-net
notify-offset admission record instead of only a "requires claim" gate. The
selected queue `1` path records accepted notify-offset admission plus blocked
wrong-queue and wrong-offset admissions after descriptor authority validation
and submit scrub; a separate queue `0` submit proves non-selected queues remain
neutral and blocked from selected-backend doorbell metadata. This is not a
real doorbell path: no virtio-net notify BAR handle is granted, no notify
register is written, no real virtio-net descriptor ring is mutated, and
production userspace NIC readiness is not claimed.

Current QEMU evidence: the same `make run-net` path now
exports a bounded live-pool snapshot from the kernel-owned virtio-net
`device_dma` ledger and feeds it into a device-manager `DMAPool` record proof.
The live record carries buffer/page count, live bytes, current in-flight
submissions, committed/resident/unswappable flags, and scrub-before-release
policy. The device-manager proof rejects both `DmaMappingsRemoved` and teardown
detach while that authoritative ledger snapshot remains live. The proof now
calls the `device_dma` teardown-evidence API and records the expected
`authoritative-ledger-live` block with matching imported live accounting, then
reports completion as deferred because no authoritative zero-live/scrubbed
evidence is available for the live virtio-net ledger. It does not zero the
imported record to simulate teardown, does not claim `DmaMappingsRemoved`,
terminal `Dead`, or release for the live virtio-net pages, and does not scrub
or free live virtio-net DMA pages. This is still a prerequisite
record-accounting proof:
the current pages remain kernel-owned, only bounded info-skeleton hardware cap
grants are exposed to userspace, production page release hooks for live
virtio-net DMA are not wired, IOMMU remapping is not
programmed, and S.11.2 hostile smokes remain open.

The same smoke emits a separate `device_dma` scratch proof for the positive
zero-live teardown-evidence path: `teardown_evidence()` fails closed before
quiesce and scrub markers, rejects one-marker states, and reports
`authoritative-ledger-zero-live` only after both markers are set. The scratch
proof revalidates the live virtio-net ledger but does not mutate, zero, scrub,
free, or claim teardown completion for real virtio-net DMA pages.

Implementation note, 2026-05-03 03:21 UTC: the zero-live device-manager
`DMAPool` lifecycle proof now consumes that scratch zero-live evidence before
final pool detach, and binds that evidence to the attached record source. The
manager-owned zero-live record is labeled `device-manager` /
`zero-live-dmapool-bounce-buffer`; imported live records keep the source labels from
the authoritative `device_dma` snapshot. After the proof-scoped buffer record
is actively freed and detached, a zero-live pool detach with mismatched
`virtio-net` / `kernel` scratch evidence fails closed as
`dmapool-zero-live-evidence-invalid`, and detach without authoritative
evidence fails closed as `dmapool-zero-live-evidence-absent`. Only scratch
`authoritative-ledger-zero-live` evidence carrying the same record source plus
both quiesce and scrub markers allows the manager-owned record to detach and
the revocation path to advance to `DmaMappingsRemoved`, `Dead`, and release.
This remains scratch/no-real-DMA evidence: it does not tear down live
virtio-net pages, program or remove IOMMU mappings, expose userspace
`DMAPool` mapping/descriptor authority, or claim production page cleanup
beyond the bounded manager-attached bounce page.

The current scratch proof set also covers stale DMA page handles without
touching real virtio-net pages: reusing the same synthetic physical page bumps
the DMA page generation, the old handle fails closed as `stale-dma-handle`,
wrong-queue and wrong-label frees preserve the active page record, and duplicate
free remains rejected. Production userspace `DMAPool` stale-handle smokes,
descriptor-abuse coverage, revoke/reset races, and real quiesce/scrub/release
remain open.

Implementation note, 2026-05-02 23:23 UTC: the kernel-owned virtio-net
`device_dma` page release path now validates the `DeviceDmaAllocation` against
the live ledger before any scrub or frame-allocator call, scrubs the frame
through the HHDM mapping, removes the ledger entry only after scrub succeeds,
and then returns the frame to the allocator. The frame scrub helper checks
frame alignment, HHDM/allocator initialization, range, and allocated state
before zeroing the page. `make run-net` emits a bounded
`device-dma: release scrub proof` line using a proof-only kernel-owned page:
stale generation, wrong queue, and wrong label release attempts fail before
scrub, frame free, or ledger mutation; the active release path records
`scrub_before_frame_free=true` and `ledger_removed_after_scrub=true`. This is
still no-userspace-handle/no-real-teardown evidence for the current
kernel-owned virtio-net path only; production `DMAPool` handles, real
device-manager teardown hooks, IOMMU/bounce-buffer mapping removal, hostile
stale-DMA smokes, and full page lifecycle cleanup remain open.

Implementation note, 2026-05-02 12:28 UTC: the same scratch proof family now
covers stale DMA completion ordering without touching real virtio-net pages. A
synthetic reused DMA page slot bumps generation, stale completion validation
checks the page generation before queue-completion accounting, and the old
completion fails closed as `stale-dma-handle` before completion counters can
underflow or the reused page/submission state can mutate. This is still
prerequisite evidence only: production userspace `DMAPool` hostile smokes,
reset/revoke races with outstanding descriptors, CQ notification publication,
real quiesce/scrub/release, and IOMMU or bounce-buffer teardown remain open.

Implementation note, 2026-05-02 14:03 UTC: that scratch stale-DMA-completion
proof now adapts the synthetic DMA buffer slot into
`capos-lib::device_authority` before completion accounting can mutate. The
QEMU line records current-handle validation as `ok`, stale same-slot reuse as
`stale-slot-generation`, and `side-effect-blocked`, then preserves the
existing `stale-dma-handle` completion outcome. This remains a
scratch/no-real-DMA validator adapter, not production userspace `DMAPool`
authority or S.11.2 hostile-smoke completion.

Implementation note, 2026-05-02 15:15 UTC: the same scratch proof family now
adds a paired stale-completion publication check. A synthetic reset bumps the
device owner generation so an old completion fails as
`stale-owner-generation` with `side-effect-blocked` before any CQ publication.
The same-slot reuse path then fails the old completion as `stale-dma-handle`,
preserves the new submission accounting, and records both
`cq_publication_blocked=true` and `new_owner_exposure_blocked=true`. This is
still scratch/no-real-DMA evidence; production userspace `DMAPool` completion
notification, real hardware stale-completion injection, reset/revoke races,
and IOMMU or bounce-buffer teardown remain open.

Implementation note, 2026-05-02 12:46 UTC: the device-manager interrupt
handoff proof now includes a bounded stale IRQ after-detach check. After an
attached interrupt route is detached, the proof delivers the old LAPIC vector
through the dispatch path and requires `stale_irq_delivery_after_detach` to be
`unregistered` with `stale_irq_wake_blocked=true`; the old route handle also
continues to fail as `interrupt-stale-route`. This is prerequisite evidence
for route teardown ordering only. It does not cover a pending hardware IRQ
across reset, userspace `Interrupt` waiter wakeup semantics, or reassignment
reuse by a new owner.

Implementation note, 2026-05-02 14:24 UTC: the same device-manager interrupt
handoff proof now adapts the attached source into
`capos-lib::device_authority` before exercising the active wait path. After
revoke begins, the old handle fails the pure validator as
`stale-owner-generation` with `side-effect-blocked`, then the proof preserves
the existing `interrupt-stale-route`, detached-vector `unregistered`, and
`stale_irq_wake_blocked=true` checks. This remains a proof adapter, not
production userspace `Interrupt` handles, real waiters, reset/reassignment
reuse proof, or S.11.2 hostile-smoke completion.

Implementation note, 2026-05-02 14:51 UTC: the interrupt handoff smoke now
adds two bounded stale IRQ ordering points. After revoke begins and while the
route is still attached, delivery to the old vector remains `masked`, matches
the attached route generation, reports wake blocking, and leaves the old route
delivery count unchanged. During the reset phase, a synthetic route-registry
same-vector reuse proof re-registers and claims the route with bumped source
and route generations, then shows delivery to that vector is still `masked`,
matches the new route generation, and leaves the reused route delivery count
unchanged.
This is route-manager prerequisite evidence only: it is not a true pending
hardware MSI/reset hostile smoke, does not involve userspace `Interrupt`
waiters, and does not prove DMA buffer reuse race closure.

Implementation note, 2026-05-03 18:48 UTC: the interrupt handoff smoke now
snapshots a bounded pending IRQ token from the old vector, source id, source
generation, and route generation before revocation. Checking that token after
revoke blocks as `stale-pending-irq-masked` with reason `route-masked`; after
detach it blocks as `stale-pending-irq-unregistered`; and after reset/reuse it
blocks as `stale-pending-irq-generation` with reason
`source-route-generation-mismatch`. Each check records
`side-effect-blocked`, wake blocking, and unchanged delivery counts, and the
reset/reuse check records that the new route did not receive a delivery count.
The same proof rejects a malformed pending token with a zero generation as
`stale-pending-irq-invalid-state` before any delivery-count mutation.
This was bounded token-generation evidence only and did not inject a real
pending MSI across reset; the real-`int $vector` injection added at
`2026-05-05 18:17 UTC` (see below) closes the S.11.2.7 stale IRQ
hostile-smoke gate row by exercising the production CPU exception entry path
across the same revoke, detach, and reset/reuse boundaries.

Implementation note, 2026-05-09 18:12 UTC: the pending IRQ token decision path
now has a pure `capos-lib::device_authority` validator. Host tests cover
current-route acceptance and the same fail-closed label space used by the
kernel/QEMU proof: stale source generation, stale route generation, both
generations changed after reuse, source mismatch, route masked, route
unregistered, invalid route state, invalid owner, malformed zero-generation or
unassigned source identity, and unsupported vector. The kernel still snapshots
the live dispatch slot and delivery count, but delegates the pending-token
identity/state decision to that shared helper before returning
`stale-pending-irq-*` labels. This is validator/adapter coverage only; it does
not expose production userspace `Interrupt` waiters or wait/ack/mask/unmask
authority.

Implementation note, 2026-05-05 18:17 UTC: the interrupt handoff smoke now
fires a real `INT $vector` instruction at the device MSI vector at three
points across the revoke/reset/reuse boundary, exercising the production
IDT entry, `extern "x86-interrupt"` stub, `record_lapic_delivery` dispatch
slot read, and LAPIC EOI write rather than the helper-call path the prior
proofs used. The new proof scope strings drop "no-real-msi" and read
`stale-vector-after-detach-real-int-vector-injected`,
`manager-attached-claimed-masked-after-revoke-real-int-vector-injected`,
`route-registry-vector-reuse-during-reset-real-int-vector-injected`, and
`bounded-pending-irq-token-generation-real-int-vector-cross-reset-injection`.
Each injection point requires the slot's delivery count to remain zero
before and after the real INT, and the post-INT outcome to match
`masked` (after revoke, route still attached but masked), `unregistered`
(after detach, slot cleared), and `masked` (after reset/reuse, slot now
belongs to a freshly registered+claimed route with bumped source and route
generations). The proof emits a closure summary
`s11_2_7_proof_scope=s11-2-7-stale-irq-after-reset-real-int-vector-cross-reset-injection-no-userspace-waiter`,
`s11_2_7_real_irq_injected_across_reset=ok`,
`s11_2_7_old_waiter_cannot_wake_new_owner=true`, and
`s11_2_7_stale_ack_blocked=true`. This closes the S.11.2.7 row of the
hostile-smoke acceptance matrix at `make run-net` (which invokes
`tools/qemu-net-smoke.sh`). It does not yet create a userspace `Interrupt`
waiter object; the in-flight delivery is observed via the kernel-owned
dispatch slot atomic state machine that the production path consumes.
S.11.2.9 hostile-smoke gate-wiring closed `2026-05-05 20:49 UTC` (see
the implementation note below).

Implementation note, 2026-05-05 19:37 UTC: the device-manager hostile-smoke
suite now closes the S.11.2.8 stale-DMA-completion-after-reset row.
`prove_qemu_stale_dma_completion_handoff` claims a fresh probe-then-driver
record on the virtio-net PCI BDF (separate from the live virtio-net driver
state) and walks it through the same revoke, detach, and reset/reuse
boundaries S.11.2.7 uses. At each boundary the proof allocates a real
virtio-net DMA page through the production
`device_dma::allocate_virtio_net_page` helper, frees the page through
`device_dma::free_virtio_net_page`, reallocates so the live ledger's page
generation advances, and synthesizes a stale `DeviceDmaAllocation` keyed to
the live phys with a decremented generation. The synthesized stale handle
is then fed to the production
`device_dma::record_virtio_net_completion_for_allocation` path -- the same
function the live virtio-net `Virtqueue::record_used_completion_for_allocation`
invokes after descriptor tracking validates a hardware used-ring entry.
The production validator rejects each stale injection as `stale-dma-handle`
before any queue accounting decrement, completion side effect, CQ
publication, or new-owner memory exposure. The bounded `run-net` proof
records `real_completion_inject_after_revoke_result=stale-dma-handle`,
`real_completion_inject_after_detach_result=stale-dma-handle`,
`real_completion_inject_after_reset_reuse_result=stale-dma-handle`, all
three with `side-effect-blocked`, `queue_account_preserved=true`,
`live_page_preserved=true`, `cq_publication_blocked=true`,
`new_owner_exposure_blocked=true`, `freed_buffer_unchanged=true`, and
`generation_bumped=true`, plus a closure summary
`s11_2_8_proof_scope=s11-2-8-stale-dma-completion-after-reset-real-free-realloc-cross-revoke-detach-reset-reuse-no-userspace-dmapool`,
`s11_2_8_real_completion_injected_across_reset=ok`,
`s11_2_8_old_completion_cannot_publish_to_new_owner=true`,
`s11_2_8_freed_buffer_reuse_blocked=true`, and
`s11_2_8_accounting_underflow_blocked=true`. The new shape is enforced in
`tools/qemu-net-smoke.sh` and runs from `make run-net`. This is the
production paired stale-DMA-completion proof showing old completions cannot
publish stale CQ notifications or expose new-owner memory after real
revoke, detach, and reset/reuse boundaries with real free + realloc page
generation advances on the live kernel-owned ledger; S.11.2.9 hostile-smoke
gate-wiring closed `2026-05-05 20:49 UTC` (see the implementation note
below). Userspace `DMAPool` handles and real device-manager page
quiesce/scrub/release hooks remain open as separate follow-ups.

Implementation note, 2026-05-05 20:49 UTC: the S.11.2.9 hostile-smoke
coverage row of the acceptance matrix is closed by aggregating every
matrix-row proof line into the `make run-net` -> `tools/qemu-net-smoke.sh`
gate. Every proof line referenced by the matrix has at least one
assertion in the harness today; the assertion shape varies by row.
The two driver-crash lines wired by that gate slice, the existing
S.11.2.8 `device-manager: dma completion handoff proof` closure-summary
line, and the S.11.2.7 `device-manager: interrupt handoff proof`
closure-summary line (whose trailing anchor was added by this slice
for harness-strictness consistency with S.11.2.8) all use anchored
extended-regex assertions (field-by-field match plus
`proof_result=ok[[:cntrl:]]?$` trailing anchor); other matrix-row rows
reuse the harness's pre-existing mix of unanchored extended-regex and
fixed-string `grep -Fq` assertions on the emitted proof lines. The two previously unasserted lines wired by this
slice are `device-manager: devicemmio driver crash hook proof
source=devicemmio-driver-crash-hook ...
trigger_path=trigger-driver-crash-for-devicemmio` and
`device-manager: interrupt driver crash hook proof
source=interrupt-driver-crash-hook ...
trigger_path=trigger-driver-crash-for-interrupt`. Both proofs were
already emitted by the kernel on every boot (via
`prove_qemu_devicemmio_driver_crash_hook` and
`prove_qemu_interrupt_driver_crash_hook` in `kernel/src/device_manager/proofs.rs`)
and exercise the explicit driver-crash teardown trigger path with a stale
rerun noop, validate-live `revoked` cap state, and `cap_release_after_crash`
as `noop`. The chosen gate strategy keeps S.11.2.9 inside `make run-net`
rather than splitting into a separate `make run-hostile-smokes` target,
because all six matrix rows depend on the same virtio-net device bring-up
state (probe-then-driver records on the virtio-net BDF, real IDT vector
injection, real DMA page free + reallocate). A separate target would
duplicate the bring-up cost without adding coverage. Tightening the
remaining unanchored assertions to the same anchored shape is a
follow-up harness-hardening task and is not part of S.11.2.9 closure because
each affected proof line is still uniquely identified by its
emitted prefix and the asserted field set. Production userspace
`DMAPool`/`DeviceMmio`/`Interrupt` handles, real device-manager page
quiesce/scrub/release hooks, hardware-backed provider-driver `Interrupt`
wait/ack dispatch beyond the current bounded route-dispatch waiter proof,
durable/signed production audit consumption beyond the first volatile
`HardwareAuditLog.snapshot` cap, and IOMMU domain programming all remain open
as separate follow-ups tracked in
`docs/backlog/hardware-boot-storage.md` and the `docs/tasks/README.md`
userspace-driver-transition bullet.

Implementation note, 2026-05-08 09:44 UTC: the same `make run-net` gate now also
asserts cap-specific DMA driver-crash proofs. `DmaBufferCap` routes the
explicit trigger through the bounded `FreeBuffer` cleanup path and proves page
scrub/ledger/frame-free labels before stale rerun and post-trigger cap release
both return `noop`; `DmaPoolCap` routes the explicit trigger through the
zero-live evidence-gated detach path and proves authoritative zero-live,
quiesced, and scrubbed evidence labels before stale rerun and post-trigger cap
release return `noop`.

Implementation note, 2026-05-03 01:05 UTC: the schema and kernel now include a
result-only `Interrupt.info` skeleton that can wrap a manager-issued device
handle plus the attached `DeviceInterruptRoute`. The object validates the
live manager record, owner, claimed route, and attached route record through
`validate_interrupt_record()` before returning status labels such as
`userspaceInterrupt=manager-issued-skeleton`,
`managerRecord=validated-active`, `routeRecord=manager-attached-route`,
`realInterruptDelivery=not-delivered`, `wait=admission-check-only`,
`acknowledge=admission-check-only`, `mask=route-state-control`,
`unmask=route-state-control`, and `bootstrapGrant=blocked`. The interrupt handoff
QEMU proof constructs that cap object while the route record is active,
records `interrupt_cap_info_result=ok`, exercises the serialized
`CapObject::call(0, &[])` path and decodes the returned `Interrupt.info`
Cap'n Proto result as `interrupt_cap_serialized_call_result=ok`, then verifies
the same cap fails closed after revoke begins as
`interrupt_cap_stale_after_revoke_result=interrupt-stale-handle`; the same
stale object also fails the serialized method-0 path as
`interrupt_cap_serialized_stale_after_revoke_result=invoke-failed`. A later
manifest-grant smoke explicitly releases the granted `Interrupt` cap through
`CAP_OP_RELEASE` and proves a subsequent typed `Interrupt.info` call fails
closed from userspace. A focused grant-cycle smoke now repeats that grant,
release, and stale-info proof twice in sequence and asserts the second
manager-grant-source acquire preserves the source generation and receives a
fresh route generation after the first release; the same smoke also decodes
both acquire/release cycles through the typed volatile
`HardwareAuditLog.snapshot` surface. The focused
hardware-audit interrupt-waiter smoke also decodes recent boot-time
`DmaBuffer`, `DmaPool`, and `Interrupt` `driver-crash` / `reset-disable`
lifecycle records from the current volatile 16-record snapshot window. The same
smoke now uses the `startSequence` cursor to decode older retained
`DeviceMmio` lifecycle rows that the default latest 16-record tail has skipped.
A `2026-05-09 19:18 UTC` follow-up adds a bounded `Interrupt.wait` admission
method to that skeleton. The method validates the same manager-attached route,
snapshots a pending-token candidate, delegates to the shared
`capos-lib::device_authority` pending-IRQ validator, and returns typed labels
through `capos-rt`; the focused grant smoke asserts the current masked-route
result `stale-pending-irq-masked`, reason `route-masked`,
`side-effect-blocked`, matching token/current source and route generations,
unchanged delivery counts, no waiter wake, and fail-closed behavior after cap
release. This is bounded
manager-issued skeleton evidence only: there is no blocking wait,
real hardware acknowledgement, real hardware mask/unmask side effect,
interrupt delivery authority, real waiter object, production `Interrupt` completion,
durable/signed audit persistence, or concurrent sharing claim. A
`2026-05-09 23:21 UTC` follow-up adds bounded `Interrupt.acknowledge`
admission to the same skeleton. It validates the manager-attached route through
the existing `Acknowledge` authority path, returns
`admission-check-only`, `interrupt-ack-not-attempted`, and
`side-effect-blocked`, and proves delivery counts remain unchanged with no
waiter wake or hardware acknowledgement.
A `2026-05-09 23:52 UTC` follow-up adds bounded `Interrupt.mask` and
`Interrupt.unmask` admission to the same skeleton. They validate the
manager-attached route through the existing `Mask` and `Unmask` authority
paths, return `admission-check-only`,
`interrupt-mask-not-attempted` / `interrupt-unmask-not-attempted`, and
`side-effect-blocked`, and prove route state and delivery counts remain
unchanged with no hardware mask/unmask, waiter wake, or IRQ delivery.
A `2026-05-10 04:01 UTC` follow-up promotes those methods to bounded
route-state control over the manager-attached dispatch slot. `Interrupt.unmask`
now changes `claimed-masked` to `driver-unmasked`, `Interrupt.mask` changes it
back to `claimed-masked`, and both preserve delivery counts while still
avoiding hardware MSI/MSI-X table programming, waiter wakeup, hardware
acknowledgement, or real IRQ delivery.
A `2026-05-10 22:54 UTC` follow-up wires real waiter completion to the
existing route-dispatch delivery counter from scheduler/poll context. The
poller observes matching delivered routes by vector, source generation, and
route generation without taking the waiter-table lock in the IRQ dispatch path,
then revalidates the manager-attached route under the route-post exclusion
before posting the deferred cap completion. The focused grant smoke proves the
first unmasked manifest-granted wait completes as `interrupt-delivered` /
`waiter-completed-irq` with `real_interrupt_delivery=delivered` and an
advanced delivery count, while a second unmasked wait still remains pending
until `Interrupt.mask` completes it through the existing
`interrupt-waiter-cancelled` / `waiter-completed-no-irq` path. Stale, masked,
released, reset, or reused routes do not wake as delivered IRQs. This remains a
bounded route-dispatch proof; it does not program hardware MSI/MSI-X tables,
acknowledge hardware, add provider-driver interrupt consumption, or claim
hostile hardware isolation.

Implementation note, 2026-05-03 13:49 UTC: the result-only
`DMAPool.info`, `DMABuffer.info`, `DeviceMmio.info`, and `Interrupt.info`
surfaces now return numeric identity fields alongside the conservative status
labels. The fields mirror the documented handle identity shape: `deviceId`,
BDF bus/device/function, owner generation, and the relevant pool id/generation,
buffer slot/generation, BAR/mapping id/generation, or interrupt
source/generation/route generation. The QEMU proof logs and net smoke assert
the active serialized method-0 decode for those fields, while stale method-0
calls still fail closed as `invoke-failed`. This remains a result-only
manager-issued skeleton surface; it does not add production DMA allocation,
free, map, submit, or completion authority, production MMIO mapping or
doorbell authority, production interrupt wait/ack/mask/unmask authority, real
DMA page cleanup/reuse, hostile hardware isolation, or S.11.2 completion.

Implementation note, 2026-05-03 16:37 UTC: the bounded interrupt route
identity skeleton now carries separate source and route generations end to end.
`DeviceInterruptRoute`, `LegacyIoApicInterruptRoute`, route records,
diagnostics summaries, dispatch-slot metadata, and the device-manager attached
interrupt bridge store both generations. Registration allocates both fields,
PCI MSI-X route reassignment preserves source generation while bumping only the
route generation, release/re-register allocates both generations fresh, and
`Interrupt.info` returns the independent values. The QEMU net smoke asserts the
split in PCI and legacy route logs, metadata proof logs, handoff proof logs,
serialized `Interrupt.info`, and cap-release proof logs. This closes only the
bounded identity proof gap; it does not expose production userspace
`Interrupt` authority, create real waiters, or complete the S.11.2 hostile IRQ
smoke matrix.

## `DeviceMmio` Invariants

`DeviceMmio` is register authority, not memory authority.

- **Authority:** A holder may map only BARs or subranges recorded in the
  claimed device object. It may not map PCI config space globally, another
  function's BAR, RAM, ROM, or synthetic kernel pages.
- **Handle identity:** Each call checks the claimed device id, owner
  generation, BAR or subrange mapping record, and mapping generation before
  mapping, unmapping, reading, or writing registers.
- **Physical range:** Each mapping is bounded to the BAR's decoded physical
  range, page-rounded by the kernel, and tagged as device memory with cache
  attributes appropriate for MMIO. Partial BAR grants must preserve page-level
  isolation; otherwise the grant must cover the whole page-aligned register
  window and be treated as that much authority.
- **Ownership:** At most one mutable driver owner controls a device function's
  MMIO at a time. Management capabilities may inspect topology, but register
  writes require the claimed `DeviceMmio` object.
- **No DMA implication:** Mapping registers does not grant any DMA buffer,
  frame allocation, interrupt, or config-space authority. Doorbell writes are
  accepted only as effects of register access; descriptor validity is enforced
  by `DMAPool` before queues are made visible to the device.
- **Revocation:** Revocation unmaps the driver's register pages, marks the
  device object unavailable for new calls, and invalidates outstanding MMIO
  handles. Stale mappings or calls fail closed.
- **Reset:** Revoking the final mutable `DeviceMmio` owner resets or disables
  the device unless a higher-level device manager explicitly transfers
  ownership without exposing it to an untrusted holder.

## `Interrupt` Invariants

`Interrupt` is event authority for one routed source.

- **Authority:** A holder may wait for, mask/unmask where supported, and
  acknowledge only its assigned vector, line, or MSI/MSI-X table entry. It may
  not reprogram arbitrary interrupt controllers or claim another source.
- **Handle identity:** Each wait, mask, unmask, and acknowledge checks the
  claimed device id, owner generation, source id, source generation, route
  generation, and any live waiter generation before affecting delivery state.
- **Ownership:** Each interrupt source has one delivery owner at a time.
  Shared legacy lines must be represented as a kernel-demultiplexed object
  with explicit device membership, not as ambient access to the whole line.
- **Range:** The capability records the hardware source, vector, trigger mode,
  polarity, and target CPU/routing state. User-visible operations are checked
  against that record.
- **Revocation:** Revocation masks or detaches the source, drains pending
  notifications for the old holder, invalidates waiters, and prevents stale
  acknowledgements from affecting a new owner.
- **Reset:** If the source cannot be detached cleanly, the owning device is
  reset or disabled before the interrupt is reassigned.
- **No MMIO or DMA implication:** Interrupt delivery does not grant register
  access, DMA buffers, or packet memory.

## Revocation Ordering

Device revocation must follow a fixed order:

1. Stop new submissions by invalidating the driver's user-visible handles.
2. Revoke MMIO write authority by write-blocking or unmapping BAR pages, or by
   disabling the device before any DMA teardown starts.
3. Mask or detach interrupts.
4. Quiesce virtqueues or device command queues.
5. Reset or disable the device if in-flight DMA cannot be accounted for.
6. Remove IOMMU mappings or invalidate bounce-buffer handles.
7. Scrub and free DMA pages.

This order prevents a stale driver from racing revocation with doorbell writes,
interrupt acknowledgement, or descriptor reuse. Logical handle invalidation is
not sufficient while a BAR remains mapped; register-write authority must be
removed or the device must be disabled before descriptor or DMA-buffer
ownership is reclaimed.

Implementation should represent the order as an explicit device-owner state
machine rather than as ad hoc booleans:

```rust
enum DeviceOwnerState {
    Active,
    RevokingHandles,
    MmioRevoked,
    InterruptsDetached,
    QueuesQuiesced,
    Resetting,
    DmaMappingsRemoved,
    Dead,
}
```

No path may free or reassign DMA pages until the state has reached
`QueuesQuiesced` with all in-flight descriptors accounted for, or `Resetting`
has completed and the device can no longer write old buffers. `Dead` means all
user-visible handles are invalid, interrupts are detached or masked, DMA
mappings are removed, and pages have been scrubbed or transferred to a trusted
owner.

Hard invariants:

- DMA pages cannot be freed before `QueuesQuiesced` or a completed
  `Resetting` transition proves old DMA writes are stopped.
- MMIO write authority must be revoked before DMA ownership teardown.
- Interrupt reassignment cannot happen before old pending notifications are
  drained or generation-invalidated.
- Device reset is mandatory if in-flight DMA cannot be proven stopped.

## Future Userspace-Driver Transition Criteria

Moving NIC or block drivers out of the kernel is gated by Security
Verification Track S.11.2. The gate is only open when all rows below are
implemented and demonstrated. The S.11.2.N labels are local checklist row
IDs for this gate.

The completed Device Driver Foundation selected milestone used this track as
the prerequisite for the `DMAPool`, accounting, and hostile-smoke sub-gate.
Future DDF follow-ups still use these rows as the userspace-driver transition
gate: generic MSI/MSI-X dispatch and second-device reuse may land first, but
userspace `DeviceMmio` and `Interrupt` exposure stays blocked until these rows
pass.

### Production `DMAPool` Ledger Prerequisite

Before userspace NIC or block drivers receive `DeviceMmio`, `Interrupt`, or
`DMAPool` handles, the device manager must own one ledger of record for each
claimed device. That ledger is the authoritative source for every
device-visible hold, not a diagnostic mirror of separate subsystems.

The ledger records:

- DMA pool bytes reserved and live;
- DMA buffer count, slot generation, and owner generation;
- mapped userspace DMA VMAs, quiesce state, scrub state, and release
  eligibility for each attached DMA pool;
- descriptor and ring depth limits, including live in-flight submissions and
  completions;
- page-rounded MMIO mappings and their owning `DeviceMmio` generations;
- interrupt holds, waiter generations, and routed-source generations;
- budget and OOM policy for allocation, queue growth, mapping, and interrupt
  attachment;
- teardown state in the device-owner state machine.

Every operation that creates, consumes, or releases device-visible authority
must update this ledger as part of the same ownership transaction that changes
device-manager state. That includes DMA buffer allocation/free, descriptor
submission, completion accounting, BAR mapping/unmapping, interrupt
attach/detach, reset, revoke, process exit, and capability release.

Implementation note, 2026-05-03 13:18 UTC: the QEMU `virtio-rng` metadata path
now runs a bounded teardown-trigger proof for `cap-release`, `process-exit`,
`driver-crash`, `reset-disable`, `interrupt-waiter`, `future-devicemmio`, and
`future-dmapool`. Each trigger row sequentially claims and transfers the same
PCI function, begins revocation, walks the existing device-owner state machine
to `Dead`, releases only after `Dead`, and proves generation bumps, stale
handle rejection, direct state-skip rejection, pre-`Dead` release rejection,
and per-trigger coverage without duplicates. The `cap-release` row attaches a
bounded manager-owned `DeviceMmio` record to the active driver handle, removes a
`DeviceMmioCap` from a cap table, runs the `CapOpRelease` hook, and records
cap-table removal plus detached/stale manager validation before normal
revocation. The `process-exit` row attaches the same bounded `DeviceMmio`
record shape to a real proof `Process`, runs
`Process::release_caps_for_exit()`, and records cap-table removal plus
detached/stale manager validation before normal revocation. The `driver-crash`,
`reset-disable`, and `interrupt-waiter` rows register and claim bounded PCI
MSI-X lifecycle-probe routes, attach them to the device manager, prove
`InterruptsDetached` is blocked as `interrupts-attached`, detach and release
the routes while still in `MmioRevoked`, and then advance normally.
The `future-devicemmio` row attaches a bounded manager-owned `DeviceMmio`
record from the first decoded PCI memory BAR, proves `MmioRevoked` is blocked
as `devicemmio-attached`, detaches while still in `RevokingHandles`, and then
advances normally. The `future-dmapool` row attaches a bounded zero-live
`DMAPool` record, proves `DmaMappingsRemoved` is blocked as
`dmapool-attached`, detaches while still in `Resetting`, and then advances
normally. The generic teardown-trigger summary reports no label-only rows
and seven object-backed rows, while the route-aware interrupt handoff smoke
also labels the claimed MSI-X route as bounded interrupt-waiter blocker
evidence: `interrupt_waiter_object=interrupt-route-record`,
`interrupt_waiter_block_state=InterruptsDetached`,
`interrupt_waiter_block_result=interrupts-attached`,
`interrupt_waiter_detach_result=ok`, and
`interrupt_waiter_route_generation_preserved=true`. This bounded
route-record evidence is contract proof for the shared ownership transaction
only: it does not expose production userspace authority handles, real MMIO,
real DMA, a userspace waiter, or production crash/reset observers. Separate
first `DeviceMmioCap`,
`InterruptCap`, `DmaPoolCap`, and `DmaBufferCap` release-hook proofs now
exercise both the production ring `CAP_OP_RELEASE` dispatch path and a real
`Process::release_caps_for_exit()` path for those cap objects, validating
cap-table removal plus exact manager-owned `DeviceMmio` detach,
manager-attached interrupt-route release, bounded zero-live `DMAPool` detach,
or proof-owned DMA-buffer record cleanup. The generic route-record trigger rows
and remaining DMA production work do not yet implement production observers,
production interrupt-waiter objects, userspace `DeviceMmio`, production
userspace `DMAPool`/`DMABuffer` authority, full device authority, or true
pending hardware MSI/reset-hostile route teardown.

Implementation note, 2026-05-08 10:08 UTC: the first cap-specific
reset/disable trigger entry points now exist for `DeviceMmioCap` and
`InterruptCap`. `trigger_reset_disable_for_devicemmio` and
`trigger_reset_disable_for_interrupt` route through the same idempotent
stale-safe detach helpers as cap release and driver-crash cleanup, emit one
`cap-audit: ... event=reset-disable detach=ok` line on the first successful
trigger, and keep stale reruns silent. This is still bounded trigger plumbing:
the reset/disable observer, non-proof DMA cleanup integration, userspace
MMIO/interrupt operations, and IOMMU-backed remapping work remain future
requirements.

Implementation note, 2026-05-08 10:39 UTC: the DMA caps now have the matching
cap-specific reset/disable trigger plumbing. `DmaPoolCap::on_reset_disable`
uses the same authoritative zero-live/quiesced/scrubbed evidence-gated detach
as cap release and driver-crash cleanup. `DmaBufferCap::on_reset_disable`
reuses the bounded `FreeBuffer` authority validation and page-scrub/frame-free
cleanup path, then leaves the parent pool attached until staged zero-live
cleanup. `make run-net` asserts the `dmabuffer-reset-disable-hook` and
`dmapool-reset-disable-hook` proof lines, stale rerun `noop`, revoked
cap validation, post-trigger release `noop`, and exact-one
`cap-audit: cap={dmabuffer,dmapool} event=reset-disable` lines. This is still
proof-owned no-real-DMA cleanup; production userspace `DMAPool`/`DMABuffer`
authority and non-proof page lifecycle integration remain future work.

Budget or OOM failure is closed before the driver can observe a new handle,
program a descriptor, map MMIO, attach an interrupt, or ring a doorbell. A
failed submission must leave no live descriptor hold behind, or must leave an
explicit in-flight record that teardown can drain or reset. A completed
teardown must reconcile the ledger to zero live DMA buffers, zero live MMIO
mappings, zero interrupt holds, and no in-flight descriptor submissions for
the released device generation.

Implementation note, 2026-05-02 06:59 UTC, updated 2026-05-11 06:10 UTC: the
current kernel-owned virtio-net ledger now proves the closed budget/OOM cases
above with a scratch ledger and the live ledger validation described earlier.
Imported live device-manager `DMAPool` records still preserve the
`device_dma:virtio-net` source policy and prove imported live accounting stays
within its aggregate in-flight budget while preserving that policy's per-queue
queue/submission depth limits. The manifest-granted manager-owned
bounce-buffer `DMAPool` path now attaches its own device-manager budget policy
to userspace `DMAPool.allocateBuffer` handle creation and the current
fixed-slot `DMAPool`/`DMABuffer` transfer, release, pending-release, drop,
rollback, teardown-detach, page-release, and descriptor-completion cleanup
paths. The full eight-slot pool fails as `dmapool-budget-exceeded` /
`over-buffer-budget` before allocation, cap minting, or ledger mutation, and
the selected release paths revalidate current or next accounting before
advancing manager-owned state. Production userspace `DMAPool` records must
still attach budget checks to broader provider-driver transfer/revoke/reset
transactions, IOMMU or direct-DMA mapping state, and non-fixed-slot
allocation before this row can be treated as the complete userspace-driver
transition gate.

Implementation note, 2026-05-02 08:33 UTC: the QEMU `virtio-rng` metadata path
now runs a bounded `DMAPool` record lifecycle proof on the device-manager
teardown state. The first slice keeps the record zero-live: it records a pool
slot, pool generation, and owner generation, rejects stale and
owner-mismatched attach attempts, rejects duplicate attachment, and proves that
`begin_revocation` invalidates the user-visible pool handle by bumping the
device owner generation. The ordered teardown path now fails closed with
`dmapool-attached` if it tries to enter `DmaMappingsRemoved` while the pool
record remains attached. The current continuation also proves that the revoke
handle cannot detach the zero-live pool without scratch authoritative
zero-live, quiesced, and scrubbed evidence bound to that record's source; a
mismatched scratch source is rejected before detach. With matching
proof-scoped evidence, the record detaches after queues are quiesced/reset and
before `DmaMappingsRemoved`. Later bounded manifest grants expose conservative
`DMAPool`, `DeviceMmio`, and `Interrupt` surfaces; the current `DMAPool` grant
can mint only eight fixed manager-attached proof `DMABuffer` result caps. The
remaining gap is production userspace authority, allocation beyond those eight
fixed slots, real device-visible page allocation through the device manager,
non-proof DMA page lifecycle integration, IOMMU remapping, and the S.11.2
hostile smoke matrix.

Current QEMU evidence: the QEMU virtio-net path now adds
the corresponding imported live-accounting prerequisite proof. A
device-manager `DMAPool` record is attached with accounting derived from the
live `device_dma` ledger: live buffer/page count, live bytes, current in-flight
submissions, committed/resident/unswappable residency flags, and
scrub-before-release policy. `DmaMappingsRemoved` fails closed with
`dmapool-attached` while the record remains attached, direct teardown detach
fails closed with `dmapool-live` while the authoritative ledger remains live,
and the live proof consumes the `device_dma` teardown-evidence API, observes
`authoritative-ledger-live` with matching imported live accounting, and
explicitly defers completion with no real DMA teardown attempted. The same
proof path now validates the imported `DMAPool` record through
`capos-lib::device_authority` for the active handle and stale-after-revoke
failure labels.
This does not create production userspace handles, real page-release hooks,
IOMMU mapping invalidation, scrubbed release, terminal `Dead`, or
hostile-smoke coverage for the live virtio-net record. The companion
scratch-ledger proof covers the positive zero-live teardown-evidence result
without claiming that the live virtio-net record has been torn down. The
manager-owned zero-live lifecycle proof consumes matching-source
device-manager teardown evidence for the positive detach/`DmaMappingsRemoved`
path and separately proves mismatched-source and missing-evidence detach
attempts fail closed. The manifest-granted bounded `DMAPool` path now keeps
mapped userspace VMA count, in-flight descriptor holds, residency,
quiesce/scrub state, and release eligibility in that manager record. Borrowed
or device-visible pages remain committed, resident, unswappable,
generation-bound, and unavailable for reuse until the manager record is
zero-live, unmapped, quiesced, and scrubbed. Descriptor submission is refused
while a buffer is borrowed to userspace, and release consumes manager-owned
teardown evidence instead of proof-only `device_dma` zero-live evidence.
The corresponding `DMAPool.info` ABI reports mapped VMA count, quiesce state,
scrub state, and release eligibility for QEMU proof assertions. This is still
bounded bounce-buffer lifecycle authority only: direct DMA, host physical or
IOVA exposure, IOMMU/remapping, production provider-driver consumption,
durable audit, and broader transfer/revoke policy remain future work.

| Gate item | Required state | Must-have proof |
| --- | --- | --- |
| S.11.2.0 DMA-owned buffers | `DMAPool` owns every driver-visible DMA mapping. | A driver receives opaque buffer handles or IOVA-only values; no path hands out raw host physical addresses. |
| S.11.2.1 Bound checks | Allocation, descriptor chain length, alignment, segment length, and ring depth are bounded and constant-time validated before ring submission. | Ring submissions fail closed on overflow, wrap, stale-handle, and freed-handle reuse attempts. |
| S.11.2.2 Explicit remap/ownership | `DeviceMmio` can only grant claimed BAR pages; cache attributes and write policy are enforced. | Driver cannot access unclaimed BARs, ROM, RAM pages, config-space globals, or stale mappings after revoke. |
| S.11.2.3 Interrupt correctness | `Interrupt` owns exactly one logical source at a time and drains/waits only for that source. | Reassigning an owner invalidates old waiters and masks or detaches the source first. |
| S.11.2.4 Quiesce + reset contract | Device manager can force reset/disable on failed revoke or teardown. | No in-flight descriptor may continue touching freed buffers after driver removal. |
| S.11.2.5 Process lifecycle | Capability release, process exit, and process-spawn cleanup paths cannot leak DMA pages/MMIO/intr ownership. | Crash-path teardown removes holds and invalidates user-visible handles before page free. |
| S.11.2.6 Isolation and accounting | Security Verification Track S.9 quota and authority ledgers include DMA, MMIO, and interrupt hold edges. | A malicious or buggy driver cannot consume more than its allocated authority budget. |
| S.11.2.7 Stale IRQ ordering | Stale interrupt delivery after revoke cannot wake, acknowledge, or signal a new owner. | Interrupt generation mismatch is ignored, or the source is masked/detached/reset before reassignment. Hostile smoke revokes a driver while an interrupt is pending, reassigns the source, and proves the old waiter cannot wake against the new owner. **Closed `2026-05-05 18:17 UTC`** by `make run-net`'s `device-manager: interrupt handoff proof` line: real `INT $vector` injection across revoke, detach, and reset/reuse exercises the production IDT entry/handler/EOI path, asserts `s11_2_7_real_irq_injected_across_reset=ok`, `s11_2_7_old_waiter_cannot_wake_new_owner=true`, and `s11_2_7_stale_ack_blocked=true`, and is enforced by `tools/qemu-net-smoke.sh`. Userspace `Interrupt` waiter objects remain a future requirement for a full production-driver path. |
| S.11.2.8 Stale DMA completion ordering | Stale DMA completion after revoke cannot cause freed buffer reuse, stale CQ notification, or new-owner memory exposure. | **Closed `2026-05-05 19:37 UTC`** by `make run-net`'s `device-manager: dma completion handoff proof` line: real virtio-net DMA page free + reallocate cycle bumps the live page generation, then the production `device_dma::record_virtio_net_completion_for_allocation` path (the same function the live `Virtqueue::record_used_completion_for_allocation` invokes) is fed a stale `DeviceDmaAllocation` keyed to the live phys with a decremented generation, at three boundaries (after revoke, after detach, after reset/reuse). All three reject as `stale-dma-handle` with `side-effect-blocked`, queue accounting unchanged, live new-owner page preserved, no CQ publication, no new-owner exposure, and the freed-buffer slot remaining unchanged. The closure summary asserts `s11_2_8_real_completion_injected_across_reset=ok`, `s11_2_8_old_completion_cannot_publish_to_new_owner=true`, `s11_2_8_freed_buffer_reuse_blocked=true`, and `s11_2_8_accounting_underflow_blocked=true`, and is enforced by `tools/qemu-net-smoke.sh`. Prior acceptance text: in-flight DMA is accounted for, or device reset/disable completes before buffer reuse; hostile smoke covers revoke/reset with outstanding descriptors and proves no old completion can publish new-owner memory. S.11.2.9 hostile-smoke gate-wiring also closed `2026-05-05 20:49 UTC` (see the row below). Userspace `DMAPool` handles and real device-manager page quiesce/scrub/release hooks remain open as separate follow-ups. |
| S.11.2.9 Hostile-smoke coverage | QEMU/CI smokes cover stale handles, descriptor abuse, revoke races, stale IRQ after reset, stale DMA completion after reset, and exit-under-dma. | Smoke output has explicit closed-case proof lines for each above failure mode. **Closed `2026-05-05 20:49 UTC`** by aggregating the existing per-row proof lines into the `make run-net` -> `tools/qemu-net-smoke.sh` gate. Every matrix-row proof line has at least one assertion in the harness; the original two driver-crash assertions, the existing S.11.2.8 `device-manager: dma completion handoff proof` closure-summary assertion, and the S.11.2.7 `device-manager: interrupt handoff proof` closure-summary assertion (whose trailing anchor was added by this slice for harness-strictness consistency with S.11.2.8) all use the anchored extended-regex shape (field-by-field match plus `proof_result=ok[[:cntrl:]]?$` trailing anchor), and the other matrix-row rows reuse the harness's pre-existing mix of unanchored extended-regex and fixed-string `grep -Fq` assertions. A `2026-05-08 09:44 UTC` follow-up adds anchored assertions for the cap-specific `dmabuffer-driver-crash-hook` and `dmapool-driver-crash-hook` proof lines; a `2026-05-08 10:08 UTC` follow-up adds anchored assertions and exact-one audit counts for the first cap-specific `devicemmio-reset-disable-hook` and `interrupt-reset-disable-hook` proof lines; a `2026-05-08 10:39 UTC` follow-up does the same for `dmabuffer-reset-disable-hook` and `dmapool-reset-disable-hook`; a `2026-05-08 13:42 UTC` follow-up (`aeef8b41`) adds the cap-specific `device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt` assertion plus an exact-one `cap-audit: cap=interrupt event=interrupt-waiter` count. Per-row coverage: stale DMA handle (`device-dma: stale dma handle proof`, `device-dma: live stale dma completion accounting proof`); descriptor abuse (`virtio-net: software descriptor generation model proof`, `virtio-net: invalid used descriptor id software-token proof`, `virtio-net: descriptor generation guard proof ok`, `virtio-net: invalid used descriptor id live software-token proof ok`, plus `device-dma: budget oom proof`); revoke/reset race (`device-manager: ownership proof`, the seven `device-manager: teardown trigger proof trigger=...` variants plus the final aggregate, `device-manager: dma completion handoff proof` for S.11.2.8, `device-manager: interrupt handoff proof` for S.11.2.7, the `device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio`, `device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt`, `device-manager: dmabuffer driver crash hook proof source=dmabuffer-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmabuffer`, `device-manager: dmapool driver crash hook proof source=dmapool-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmapool`, `device-manager: devicemmio reset disable hook proof source=devicemmio-reset-disable-hook ... trigger_path=trigger-reset-disable-for-devicemmio`, `device-manager: interrupt reset disable hook proof source=interrupt-reset-disable-hook ... trigger_path=trigger-reset-disable-for-interrupt`, `device-manager: dmabuffer reset disable hook proof source=dmabuffer-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmabuffer`, `device-manager: dmapool reset disable hook proof source=dmapool-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmapool`, and `device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt` lines, all requiring first-trigger `ok`, stale rerun `noop`, cap `validate_live=revoked`, post-trigger release `noop`, and `proof_result=ok` with cap-specific cleanup/evidence labels); stale IRQ after reset (S.11.2.7 closure summary, see row above); stale DMA completion after reset (S.11.2.8 closure summary, see row above); exit-under-DMA (`device-manager: teardown trigger proof trigger=process-exit owner=virtio-rng`, the teardown-trigger aggregate `triggers=cap-release,process-exit,driver-crash,reset-disable,interrupt-waiter,future-devicemmio,future-dmapool` line, the four cap-release-hook proofs each containing `process_exit_path=process-release-caps-for-exit`, plus `hardware-cap-release: ... reason=process-exit` count assertions). A `2026-05-23 21:34 UTC` follow-up adds the IOMMU production `DMAPool` hostile proof over the active mapped ledger, covering stale IOVA after revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion after reset, teardown-under-DMA ordering, cross-domain stale-handle attempts, and the fail-closed teardown branch proof; process-exit/exit-under-DMA remains the existing run-net bounce-buffer evidence. Production userspace `DeviceMmio`/`Interrupt` handles, broader non-proof device-manager page quiesce/scrub/release hooks outside the selected IOMMU smoke, hardware-backed provider-driver `Interrupt` wait/ack dispatch beyond the bounded route-dispatch waiter proof, and durable/signed production audit consumption beyond the first volatile `HardwareAuditLog.snapshot` cap remain open as separate follow-ups. |

For each row, the transition requires an owner, implementation notes, and a CI-backed
verification path. Until all rows pass, `Phase 4.2` NIC/block drivers remain in-kernel for
functionality, and only kernel-mapped bounce-buffer mode is allowed for prototype DMA.

### Hostile-Smoke Acceptance Matrix

These smokes are the acceptance requirements for the userspace driver
transition. The S.11.2.7, S.11.2.8, and S.11.2.9 rows are now backed by
current `make run-net` QEMU evidence enforced by `tools/qemu-net-smoke.sh`
(see the per-row "Closed" notes for closure timestamps and the proof-line
shapes). The other matrix rows remain acceptance requirements for future
implementation work; their proof lines are emitted by the kernel today
and asserted by the same harness, but the production userspace handles,
real device-manager page quiesce/scrub/release hooks, real userspace
`Interrupt` waiter objects, IOMMU domain programming, and durable/signed
production audit consumption beyond the volatile `HardwareAuditLog.snapshot`
cap that complete each row's full closure remain open as separate
follow-ups.

| Hostile case | Required setup | Closed-case proof expectation |
| --- | --- | --- |
| Stale DMA handle | Allocate a DMA buffer, revoke or free it, advance the slot or pool generation, then attempt descriptor submission or buffer reuse through the old handle. | The operation fails closed on generation mismatch; no descriptor is made visible to the device, no DMA byte or buffer hold is restored, and any reused slot remains owned only by the new generation. |
| Descriptor abuse | Submit chains with out-of-pool addresses, stale or freed buffer slots, arithmetic wrap, misalignment, overlong segments, excessive chain length, or ring-depth overflow. | Validation rejects the chain before any doorbell write; the ledger shows no leaked descriptor hold, no in-flight increment without an owning buffer, and no access outside the pool range. |
| Revoke/reset race | Race revoke, reset, or process teardown against a driver that is submitting descriptors or ringing the device doorbell. | Revocation first invalidates handles and MMIO write authority; later submissions fail closed, existing in-flight records are either completed under the old generation or reset/disabled before page reuse, and teardown cannot skip to `DmaMappingsRemoved` while the ledger has live submissions. |
| Stale IRQ after reset | Leave an interrupt pending or a waiter blocked, reset or reassign the device/source, then deliver or acknowledge using the old generation. | The old waiter cannot wake against the new owner, stale acknowledgements do not affect the reassigned source, and the source is masked, detached, or generation-invalidated before reassignment. **Closed `2026-05-05 18:17 UTC`**: `make run-net` injects a real `INT $vector` through the IDT/handler/EOI path at three points across revoke, detach, and reset/reuse and records `s11_2_7_real_irq_injected_across_reset=ok`, `s11_2_7_old_waiter_cannot_wake_new_owner=true`, `s11_2_7_stale_ack_blocked=true`, plus matching `real_irq_inject_after_revoke_result=masked`, `real_irq_inject_after_detach_result=unregistered`, `real_irq_inject_after_reset_reuse_result=masked` on the kernel proof line. |
| Stale DMA completion after reset | Reset with outstanding descriptors, reuse or prepare to reuse pool slots, then inject or observe a completion from the old device generation. | The stale completion cannot publish a CQE to a new owner, cannot expose new-owner memory, cannot underflow accounting, and cannot make a freed buffer eligible for reuse unless reset/disable has proven old DMA stopped. **Closed `2026-05-05 19:37 UTC`**: `make run-net` walks a fresh device-manager record on the virtio-net BDF through the `Active>RevokingHandles>MmioRevoked>InterruptsDetached>QueuesQuiesced>Resetting>DmaMappingsRemoved>Dead` revocation path, exercises a real virtio-net DMA page free + reallocate cycle at three boundaries (after revoke, after detach, after reset/reuse), and feeds a synthesized stale `DeviceDmaAllocation` (live phys, decremented generation) to the production `device_dma::record_virtio_net_completion_for_allocation` path. Each boundary records `real_completion_inject_after_*_result=stale-dma-handle`, `_side_effect=side-effect-blocked`, `_queue_account_preserved=true`, `_live_page_preserved=true`, `_cq_publication_blocked=true`, `_new_owner_exposure_blocked=true`, `_freed_buffer_unchanged=true`, and `_generation_bumped=true`, plus a closure summary `s11_2_8_real_completion_injected_across_reset=ok`, `s11_2_8_old_completion_cannot_publish_to_new_owner=true`, `s11_2_8_freed_buffer_reuse_blocked=true`, `s11_2_8_accounting_underflow_blocked=true`. |
| Exit-under-DMA | Terminate or crash a driver process while it holds DMA buffers, MMIO mappings, interrupt waiters, and in-flight descriptors. | Process exit enters the device-manager teardown path, invalidates all user-visible handles, revokes MMIO, detaches interrupts, quiesces or resets queues, scrubs DMA pages before release, and reports a terminal ledger with no live holds for the old owner generation. |

## Security Verification Track S.11.2 Decision Record

Security Verification Track S.11.2 is backend-scoped. The current
brokered-bounce userspace-provider path has enough reviewed evidence to close
the retained DDF production-authority finding, but that closeout is not a
general direct-DMA, hostile-hardware, or device-autonomous interrupt claim.

Current status: the brokered-bounce transition path is represented by done task
evidence for `DMAPool`, `DeviceMmio`, and `Interrupt` lifecycle ownership,
provider virtio-net/NVMe chains, and hardware-audit consumption of abort-held
DMA mappings. The broader S.11.2 matrix remains the canonical gate for future
direct-remapping/vIOMMU, trusted-sharing-group, hostile-hardware-isolation, or
provider-written-address work. This document fixes the production handle epoch
invariants, `DMAPool` ledger of record, and hostile-smoke acceptance criteria
used by the completed Device Driver Foundation documentation gate. The
current QEMU virtio-net path has a kernel-owned DMA pool ledger for page,
descriptor, MMIO mapping, and interrupt-hold accounting proof coverage plus
static IOMMU attachment-policy reporting for retained DMA-capable PCI functions
and the bounded teardown trigger contract proof, bounded kernel-owned
budget/OOM proof, manager-bound `DMAPool` budget-profile proof plus bounded
budget-policy tamper and accounting-over-budget fail-closed proofs,
bounded manager-owned `DeviceMmio` proof adapter bound to decoded PCI
memory-BAR metadata plus future cache/write-policy metadata, bounded zero-live
device-manager `DMAPool` record lifecycle proof, and imported live-accounting
block/defer proof plus zero-live teardown-evidence scratch proof, stale DMA
handle scratch proof, stale DMA completion scratch proof, paired scratch
CQ-publication/new-owner-exposure proof, live software descriptor-generation
guard proof, bounded invalid used-descriptor-id proof, and bounded stale IRQ
after-detach, counter-backed after-revoke, counter-backed route-registry
reset-reuse, and pending IRQ token checks described above. The
bounded pure `capos-lib::device_authority`
validator and host tests cover the documented identity, state,
side-effect-blocking, non-wrapping epoch cases, and every current operation
variant's exact blocked side-effect label for stale owner/subrecord, freed,
revoked, and retired failures. The zero-live
device-manager `DMAPool` lifecycle proof now validates a proof-scoped
tampered budget-policy record through the manager policy helper and records
fail-closed, no fake allocation, no ledger mutation, no teardown advancement,
and side-effect blocking while preserving the positive
`budget_policy_result=ok` path. The positive zero-live and imported-live
budget-accounting labels now go through the manager-owned active-record helper,
and synthetic over-budget attached-accounting candidates fail closed with exact
reasons while preserving the active manager record and blocking allocation,
ledger, teardown, and side effects; an over-budget attach candidate fails
before pool generation allocation. It also records a bounded
manager-attached DMA buffer handle under the attached pool, validates active
`SubmitDescriptor` and manager-record `CompleteDescriptor` through the pure
DMA-buffer validator, and records stale-after-revoke, freed-buffer, and
reused-slot rejection with exact reasons and `side-effect-blocked`; it now
also blocks pool teardown as
`dmapool-buffer-attached`, rejects a stale same-slot proof-scoped `FreeBuffer`
as `dmabuffer-stale-handle` with `stale-slot-generation` and
`side-effect-blocked`, rejects wrong-owner-generation, wrong-pool, wrong-pool
generation, and wrong-buffer-slot `FreeBuffer` attempts with exact pure
validator reasons and `side-effect-blocked`, preserves that manager-owned
buffer record after each failed free, and clears the record only after a
proof-scoped active `FreeBuffer` validation, proof-page scrub/free, and
manager-owned buffer-record detach. The completion proof does not publish a CQ
entry, complete a real descriptor, grant userspace authority, or clean up or
reuse production userspace DMA pages. The live virtio-net queue-completion
path now gates completion accounting on the completed descriptor's
`DeviceDmaAllocation` rather than the queue id alone: callers must validate
the used descriptor id, recover the matching `DmaPage`, and pass its physical
address, queue, label, and generation to the kernel-owned ledger before
in-flight accounting is decremented. The paired `run-net` proof records that a
stale generation for a live kernel-owned page fails as `stale-dma-handle`,
leaves queue accounting and the live page unchanged, and blocks CQ publication
plus new-owner exposure. This closes a live accounting prerequisite only; it
does not inject a real post-reset device completion or expose userspace DMA
authority. The live virtio-net used-ring path also carries bounded software
descriptor generations: submissions reject invalid or already-active descriptor
ids before accounting, completions must consume the active software token
exactly once, and the `run-net` proof records side-effect blocking for active
reuse, double completion, and an old software token after descriptor-id reuse.
That guard does not make a stale hardware used-ring id distinguishable after
deliberate id reuse because virtio used entries carry no device generation. The
same gate now also covers invalid used-descriptor ids without corrupting the
hardware ring: an out-of-range id fails as `descriptor-id-out-of-range` before
completion observation, completion accounting, `used_seen_idx`, CQ publication,
or new-owner exposure can change. This is still a software-token and
constructed-token prerequisite, not a real malformed-device or post-reset
completion injection. The
same zero-live proof now also constructs the result-only
`DMAPool.info` cap skeleton from the manager-issued `DmaPoolHandle`, validates
the active manager record before returning conservative status labels plus
numeric device/BDF/owner/pool identity fields, proves the serialized cap call
path decodes to those labels and identity fields with host physical exposure
off and direct DMA blocked, and proves the cap's info path fails closed as
`dmapool-stale-handle` after revoke begins. It also exercises
`DMAPool.allocateBuffer` through `call_with_table()` on a real cap-table entry,
returns zero-indexed `DMABuffer` result caps for eight fixed manager-owned
bounce-buffer slots, validates those result caps' `DMABuffer.info`, and proves
a ninth allocation fails through the manager-owned budget policy as
`dmapool-budget-exceeded` / `over-buffer-budget` before publishing another
result cap or corrupting live slot state; full-pool allocation also preserves
manager generation counters. Stale-after-revoke allocations still fail closed
without publishing another result cap. The same zero-live proof
constructs the result-only
`DMABuffer.info` cap skeleton from the manager-attached `DmaBufferHandle`,
validates the active manager-owned buffer record through the pure DMA-buffer
validator before returning conservative no-authority labels plus numeric
device/BDF, owner/pool/slot identity fields, proves the serialized cap call
path decodes to those labels and identity fields with host physical exposure
off and direct DMA blocked, and proves the cap's info path fails closed as
`dmabuffer-stale-handle` after revoke begins; the same stale cap's serialized
method-0 path fails as `invoke-failed`. The first `DmaBufferCap` release hook
now reuses the bounded `FreeBuffer` validation shape to clear only the
manager-attached `proof_buffer` record during cap-table removal, production
ring `CAP_OP_RELEASE`, and real `Process::release_caps_for_exit()` paths. It
proves stale same-slot release is `side-effect-blocked`, proves the parent
`DMAPool` remains attached after buffer release, proves the bounded manifest
grant can allocate the slot again after explicit `freeBuffer` with a fresh slot
generation, and still requires staged zero-live evidence before the parent pool
can detach. The selected provider-TX path now adds a bounded exception to the
default manager-accounting descriptor contract: queue `1` submits may publish
the selected eight-entry TX queue depth, descriptors `0..7`, into the existing
kernel-owned virtio-net TX ring before the first completion, ring one selected
notify doorbell per accepted provider descriptor, and then complete each
descriptor only after `DMABuffer.completeDescriptor` observes the matching
used-ring entry for the stored software descriptor generation. Those handoffs
clear the matching manager in-flight records, record bounded provider CQ
completion and acknowledgement counts, and can deliver ordered bounded
completion events to live `tx_interrupt.wait` calls for the same selected
route. The selected provider-TX path also proves a teardown-only drain when one
descriptor has completed and seven provider-published descriptors remain
incomplete: direct `DMABuffer.freeBuffer` remains blocked while in flight,
release explicitly drains only the incomplete matching used-ring entries and
retires those allocation-backed TX DMA queue ledgers without
`DMABuffer.completeDescriptor` results, no provider CQ/IRQ event is published
for the quiesced descriptors, release retires seven delivered-but-unacked
completion events, and later slot reuse requires a fresh generation plus normal
completion. Wrong-queue, stale-buffer, stale-notify, inflight-publication,
wrong-descriptor, duplicate-completion, and stale-`tx_interrupt` issue paths
remain side-effect-blocked before their guarded effects. This does not grant
direct DMA, arbitrary doorbells, arbitrary CQ ownership outside the selected TX
route, full virtio-net ownership, production NIC/storage migration, IOMMU
programming, hardware IRQ ownership, hardware acknowledgement, or broad
interrupt ownership beyond the bounded selected TX MSI-X mask/unmask proof.
The bounded `DeviceMmio` proof also records the manager-attached policy
metadata listed above, fails closed on a tampered cache/write-policy record
before creating any mapping, and validates active hostile handle identities for
wrong owner generation, wrong mapping generation, wrong mapping id, wrong BAR,
and wrong BDF/device with exact pure-validator reasons while preserving the
attached record and blocking mapping/doorbell side effects. Its serialized cap
call path also decodes to the direct `DeviceMmio.info` no-authority labels plus
numeric device/BDF, owner, BAR, mapping id, and mapping generation identity
fields with host physical exposure off and direct MMIO blocked, and its stale
serialized method-0 path fails as `invoke-failed`. The `DMAPool.info` skeleton
has the same kernel-side serialized stale failure evidence. The
interrupt handoff proof now also constructs a result-only `Interrupt.info`
cap skeleton from the manager-issued device handle and attached route record,
records active info success, proves the serialized cap call path decodes to the
direct no-authority labels plus numeric device/BDF, owner, source, source
generation, and route generation identity fields, proves those source and
route generations are distinct in the bounded route record, and proves
stale-after-revoke info fails closed as
`interrupt-stale-handle` plus stale serialized method-0 failure as
`invoke-failed` before any acknowledgement, mask, unmask, blocking wait, or
delivery authority exists. The manifest-granted skeleton now also exposes an
admission-only `Interrupt.wait` method that returns the pending-token
validator's fail-closed labels without waking a waiter or changing delivery
counts, and an admission-only `Interrupt.acknowledge` method that validates the
active route while blocking hardware acknowledgement and preserving delivery
counts. It also exposes route-state-control `Interrupt.mask` and
`Interrupt.unmask` methods that validate the active route before changing the
manager-attached dispatch slot between `claimed-masked` and
`driver-unmasked`, while preserving delivery counts. A bounded
`Interrupt.wait` call observed after unmask installs a fixed-table userspace
waiter object for the current manager-granted route; the existing
route-dispatch delivery counter can now complete that waiter as
`interrupt-delivered` / `waiter-completed-irq` with
`real_interrupt_delivery=delivered` and an advanced delivery count. The same
focused smoke then submits a second unmasked wait, observes it remains pending,
calls `Interrupt.mask`, and finishes that wait as
`interrupt-waiter-cancelled` / `route-masked` /
`waiter-completed-no-irq` with `wake_blocked=false`, preserved source/route
generations, and unchanged delivery counts. The selected provider TX
`tx_interrupt` cap can now observe the bounded used-ring completion event
described above and account the already observed selected TX dispatch token
paired with that delivered provider CQ event, but hardware MSI/MSI-X programming
beyond the selected vector-control proof, full hardware IRQ ownership, deferred
EOI, LAPIC/MSI-X acknowledgement, and broader production interrupt dispatch
remain blocked. Provider TX MSI-X mask/unmask is limited to the selected-route
vector-control proof described earlier. Provider RX MSI-X mask/unmask remains
bounded to the selected RX route as well; release while masked restores that
selected vector-control bit and route state before clearing the live issue gate.
RX unmask admits the route transition before exposing the MSI-X vector-control
bit, and the focused QEMU proof shows a failed route unmask leaves the selected
vector masked with the route ledger preserved. Cleanup failure still leaves the
issue uncleared so future RX cap issuance stays blocked on uncertain route
state. RX wait/ack is now bounded to one selected-route zero-CQ dispatch token;
RX descriptors and CQ ownership remain blocked.
This is manager-record skeleton/no-production-DMA, no-real-MMIO-mapping, and
bounded route-dispatch interrupt-waiter prerequisite evidence only. Production
`DMAPool`, `DeviceMmio`, and `Interrupt` capability handles, production
userspace `DMAPool` buffer handles, real `DeviceMmio` BAR mapping objects,
real cache attributes/write policy enforcement, production kernel device-path
wiring beyond the current proof adapters, real device-manager page
quiesce/scrub/release hooks and real page cleanup/reuse beyond the bounded
kernel-owned proof pages, production
handle-attached budget/OOM enforcement beyond the current manager-owned
`DMAPool.allocateBuffer` budget slice, IOMMU remapping domains, production
handle-attached host tests, QEMU stale-handle smokes, broader userspace
exposure, production NIC/storage migration, cloud readiness, and S.11.2
hostile smokes remain open.

Do not weaken the short-term virtio-net bounce-buffer path until `DMAPool`,
`DeviceMmio`, `Interrupt`, device-manager ownership transactions, lifecycle
teardown, accounting, and hostile smokes all exist.
