DMA Isolation Design

Security Verification Track S.11 gates PCI, virtio, and later userspace device-driver work on an explicit DMA authority model. The immediate goal is narrow: let the kernel bring up a QEMU virtio-net smoke without creating a user-visible raw physical-memory escape hatch.

Short-Term Decision

Use kernel-owned bounce buffers for the first in-kernel QEMU virtio-net smoke.

The first virtio-net smoke stays on this conservative path:

kernel-owned DMA pages
kernel-owned virtqueue descriptor tables
kernel-owned packet buffers
kernel-programmed physical addresses
copied packet bytes delivered to the network stack
no DMA buffer capability exposed to userspace
no physical address exposed to userspace
no virtqueue pointer exposed to userspace
no BAR mapping exposed to userspace

The kernel allocates DMA-capable pages from its own frame allocator, owns the virtqueue descriptor tables and packet buffers, programs the device with the corresponding physical addresses, and copies packet payloads between those buffers and the networking stack.

This is deliberately conservative:

It works before ACPI/DMAR or AMD-Vi parsing, IOMMU page-table management, MSI/MSI-X routing, and userspace driver lifecycle supervision exist.
It keeps all physical-address programming inside the kernel, where the same code that allocates the frames also bounds the descriptors that reference them.
It does not make the current FrameAllocator or MemoryObject capability part of the DMA path. FrameAllocator no longer exposes raw physical addresses, but DMA still needs device-owned buffer objects with IOVA and reset/revoke semantics rather than repurposed general memory caps.
It gives the smoke a disposable implementation path. When NIC or block drivers move to userspace, bounce-buffer authority becomes a typed DMAPool object instead of an ad hoc physical-address grant.

An IOMMU-backed DMA-domain model remains the target for direct device access from mutually untrusted userspace drivers, but it is not a prerequisite for the first QEMU smoke. Without an IOMMU, a malicious bus-mastering device can still DMA to arbitrary RAM at the hardware level; the short-term smoke assumes QEMU-provided virtio hardware and protects against confused or untrusted userspace, not hostile hardware.

IOMMU Staging

IOMMU support is a deferred-with-known-dependency prerequisite for production hardware claims and for moving direct DMA-capable NIC or block drivers into userspace. capOS now discovers bounded ACPI IOMMU table summaries for Intel DMAR and AMD-Vi/IVRS and records static DMAR DRHD include-all or single-hop PCI endpoint device-scope coverage for retained DMA-capable PCI diagnostics functions. Bridge and multi-hop scopes are retained for diagnostics but do not prove endpoint attachment until PCI topology traversal exists, and include-all fallback fails closed when retained DRHD units or scopes are capped.

The selected QEMU Intel remapping path now programs VT-d root/context and second-level tables for manager-owned DMAPool pages, reports bounded fault state, exports only domain-scoped IOVAs, and proves two claimed DMA-capable functions receive distinct per-device domains and second-level roots. It also asserts the production-path S.11.2 hostile-smoke matrix over the active DMAPool / DMABuffer ledger. The decomposed integration umbrella for this path closed 2026-05-23 23:35 UTC (ddf-iommu-remapping-production-closeout). This is still QEMU-only evidence for the selected path, not a general production hardware-isolation claim: trusted sharing groups, AMD-Vi programming, and production NIC/block userspace driver authority remain future work, and VM shapes without usable remapping hardware remain on the explicit bounce-buffer fallback.

The discovery parser is intentionally shallow and follows the static-table formats documented by the Intel VT-d architecture specification, the AMD IOMMU specification, and QEMU’s q35-only -device intel-iommu emulation:

Future real remapping work is grounded by the primary-source IOMMU remapping research note, which records Intel VT-d, AMD-Vi, and QEMU sections relevant to table programming, invalidation, fault/status diagnostics, and QEMU-only smoke tests. That note is source grounding only; it does not make the current diagnostics path a real remapping implementation.

Intel VT-d architecture specification: https://www.intel.com/content/www/us/en/content-details/671081/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
AMD IOMMU specification: https://docs.amd.com/v/u/en-US/48882_IOMMU
QEMU manpage: https://www.qemu.org/docs/master/system/qemu-manpage.html?highlight=numa

The staged implementation order is:

Discover firmware IOMMU topology from ACPI static tables and fail closed if the tables are malformed, unsupported, or inconsistent with the PCI root complex being used. This first bounded table-discovery step is implemented for DMAR/IVRS summaries only; domain attachment is still planned.
Record each DMA-capable PCI function’s attachment to an IOMMU unit, or explicitly keep the function on the prototype bounce-buffer-required policy when no trusted IOMMU domain can be created. This reporting step is implemented for retained PCI diagnostics functions when DMAR DRHD include-all or single-hop PCI endpoint device-scope metadata proves PCI segment/BDF coverage. Bridge and multi-hop scopes are not treated as attachment proof until PCI topology traversal exists, and include-all fallback fails closed when retained DMAR coverage metadata is capped; trusted domain creation is still planned.
Define and prove the claimed-device domain policy: one device-manager-owned DMA domain per claimed device or trusted sharing group, with all exported device addresses represented as IOVAs scoped to that domain rather than host physical addresses. The selected QEMU Intel path now implements the per-device form for two claimed DMA-capable functions; trusted sharing groups remain disabled and out of scope.
Attach DMAPool allocation, descriptor validation, MMIO ownership, interrupt ownership, and revocation state to the same device-manager ledger before any doorbell write can make a descriptor visible to hardware.
On revoke, reset, or driver death, stop new submissions, remove or invalidate IOMMU mappings before page reuse, and flush the relevant IOTLB state where the hardware model requires it.

Until those gates exist, direct DMA and userspace driver handoff remain blocked. Devices that cannot be placed in a trusted IOMMU domain must stay on kernel-owned bounce buffers or remain unsupported for production claims. This also affects the hostile-smoke gate: S.11.2 smokes must prove that stale DMA handles, stale completions, reset races, and teardown ordering fail closed for IOMMU-backed IOVA mappings, while the process-exit / exit-under-DMA rows remain covered by the selected backend evidence before a cloud or hardware driver can be treated as isolated from the rest of memory.

Fallback Policy For No Usable IOMMU Exposure

Some providers or VM shapes may not expose remapping hardware that capOS can trust. That includes absent, malformed, unsupported, capped, or incomplete DMAR/IVRS metadata; scopes that require PCI topology traversal capOS has not implemented yet; and platforms where remapping hardware is unavailable or cannot be programmed safely. Those shapes use a fail-closed fallback policy:

Direct device DMA remains blocked. direct_dma_trusted_domains stays zero and remapping_tables stays not-programmed.
Prototype devices that remain enabled use kernel-owned bounce buffers only. The kernel or device manager owns the pages, descriptor validation, physical-address programming, and packet or block-data copies between device-visible memory and non-device memory. General FrameAllocator and MemoryObject capabilities are not DMA authorities.
capOS does not expose direct hardware authority for userspace DMAPool, DMABuffer, DeviceMmio, or Interrupt in the fallback shape. Result-only .info skeletons and bounded manifest grants may report conservative status. The current DMAPool manifest grant may allocate and free eight fixed manager-attached, kernel-owned, single-page bounce-buffer DMABuffer result caps, with backing pages scrubbed before frame release and no host physical address or IOVA exposed. That narrow fixed-slot allocation/free authority does not map DMA, program device-visible addresses, publish arbitrary CQ entries, program IOMMU/remapping tables, access arbitrary BAR registers or doorbells, or own hardware interrupt acknowledgement, mask, or unmask. The selected provider-TX proof is the current bounded exception: after the same manager-owned DMABuffer authority and bounce-scrub gates, queue 1 may publish the full selected TX queue-depth descriptor/avail window into the existing kernel-owned virtio-net TX ring before the first completion, ring one selected notify doorbell per accepted provider descriptor through the live no-write notify_mmio policy, and hand those bounded completions back through descriptor/generation-matched DMABuffer.completeDescriptor plus live tx_interrupt.wait completion events. The same selected path can also use tx_interrupt.mask/unmask to toggle only the selected TX MSI-X table vector-control bit and matching route state after live issue-id and route validation, and can retire one deferred LAPIC EOI for each delivered selected TX used-ring completion event, with Interrupt.acknowledge returning ABI-visible provider CQ/ack ledger fields plus hardware dispatch ack count, delta, token, and mutation flag for that bounded pairing. Full-queue QEMU bursts that coalesce selected TX MSI-X delivery use a bounded INT $vector proof hook only while the virtio TX completion path has an active full-window coalescing budget, so the selected IDT handler and deferred-EOI path remain observable without claiming full production IRQ ownership. Successful selected queue 1 DMABuffer.completeDescriptor, tx_interrupt.wait, and tx_interrupt.acknowledge results also carry bounded CQ event identity: sequence, queue, descriptor id, slot, slot generation, software descriptor generation, completion length, provider issue id, source id/generation, and route generation. Pre-event, duplicate ack, masked-route ack, wrong-order completion, teardown-drain, stale issue after release/regrant, reset, and stale-after-release paths keep that identity empty and do not mutate the bounded identity queue. Provider TX release also retires delivered but unacknowledged bounded CQ events for the live issue before clearing that issue: the stale post-release ack path is revoked, and the release proof records seven pending provider completion acks and their deferred EOIs as release-retired. The same selected path also has a bounded teardown-only drain for seven incomplete provider-published TX descriptors while one completed descriptor remains live: release may explicitly drain only the incomplete matching used-ring entries, retire those allocation-backed device-DMA TX queue ledgers, and free only after manager in-flight state is drained, without publishing provider CQ/IRQ events or issuing DMABuffer.completeDescriptor results. The paired provider RX bootstrap grant can now validate the live RX issue and selected virtio-net RX route before toggling only the selected RX MSI-X table vector-control bit and route state, and it can complete one selected-route RX Interrupt.wait after a delivered RX MSI-X/LAPIC dispatch. The paired Interrupt.acknowledge accounts exactly one RX dispatch token and retires one deferred LAPIC EOI for that delivered zero-CQ RX event; pre-event, masked-route, duplicate, and stale-after-release paths fail closed without mutating delivery or acknowledgement state. RX descriptor accounting and RX CQ ownership remain bounded to the synthetic proof path, and full hardware IRQ ownership remains blocked. These exceptions do not transfer full virtio-net ownership, direct DMA, IOMMU authority, arbitrary doorbells, production NIC/storage authority, or cloud readiness.
capOS does not claim hostile-hardware isolation for those shapes. A malicious or compromised bus-mastering device without a trusted remapping domain can still write arbitrary RAM at the hardware level. The fallback is acceptable only for prototype devices and trusted emulator or provider shapes where that hardware threat is outside the claim; otherwise the device remains unsupported.
Before any userspace driver path can rely on DMA or IRQ authority, S.11.2 hostile smokes must pass for the selected backend. That includes stale DMA handles, stale completions, descriptor abuse, revoke/reset races, stale IRQs, teardown-under-DMA for IOMMU-backed IOVA mappings, and exit-under-DMA for the fallback bounce-buffer path when the fallback is used.

This fallback policy is separate from current diagnostics-only IOMMU metadata coverage and from future real remapping-domain integration. Diagnostics can report static firmware-table coverage for a PCI function, but unless capOS creates a device-manager-owned remapping domain and programs mappings, the active direct-DMA policy remains blocked. Future real integration must attach DMAPool, DeviceMmio, Interrupt, ledger teardown, mapping removal or invalidation, and required IOTLB flushes to the same ownership transaction before a direct-DMA trusted-domain count can become nonzero.

DMA Assurance Model Checked Evidence And Cloud Backend Inputs

The DMA assurance model records the claim boundary and checked bounded evidence for DMA authority; the cloud backend contract it feeds is authoritative and lives in the “Cloud DMA Backend” section below: DMA Assurance Model and models/dma/. It is a design/evidence scaffold, not a new production hardware gate by itself. The checked gates are make model-dma-tla, make model-dma-alloy, make kani-dma-authority, and make model-dma-deferred-completion-loom; make dma-assurance-model-check aggregates them locally, while GitHub CI runs the Alloy/TLA+/Loom gates in dma-assurance-models and the Kani gate in kani-proofs. The operationalization track that reconciled the skeletons against landed DMA code is tracked in Security and Verification (“DMA Assurance Model Operationalization”).

Cloud NIC/storage work must use the model as the checklist for backend selection. Backend selection is a runtime, fail-closed decision the kernel makes on each boot, with an optional operator override declared in the system manifest; it is not a per-VM-shape safety assertion that a person signs off. The authoritative selection rule and the manifest override contract are defined in the “Cloud DMA Backend” section below.

Cloud backend evidence must separate provider-side DMA isolation from guest-controlled remapping authority. SR-IOV, virtual NIC, GPU, accelerator, or local NVMe support can identify a DMA-capable surface, but it is not enough to claim direct-DMA isolation. A direct-remapping backend needs guest-visible IOMMU or equivalent translation authority that capOS can discover and program. The cloud evidence matrix must record provider API or documentation sources, retrieval date, region or zone, instance type, image and kernel, live guest PCI/device probes, IOMMU table/group observations, and maintenance or device revocation behavior as the support-policy record for advertised targets. The runtime probe, not this matrix, makes the binding per-boot selection.

The matrix does not replace runtime selection. capOS must choose the safest backend on each boot from what it can actually observe and validate. Direct remapping is enabled only when guest-programmable remapping authority is present and passes the selected self-tests. A provider-remapped or bounce path is selected only when direct DMA remains blocked and device-visible memory can stay manager-owned. Ambiguous, contradictory, or unvalidated observations select Unsupported.

The backend candidates are:

Direct remapping domain. The provider shape must expose guest-programmable remapping hardware; capOS must discover and program a device-manager-owned domain for the target device; descriptor publication must be ordered after mapping; and teardown must remove mappings, observe required invalidation completion, and scrub before page reuse. The selected path must carry stale-handle, stale-completion, descriptor-abuse, revoke/reset-race, teardown-under-DMA, no-host-physical-exposure, and cross-domain alias evidence.
Labeled bounce-buffer fallback. Direct DMA stays blocked, device-visible memory remains manager-owned bounce pages, host physical addresses and generic MemoryObject authority stay hidden from the driver, and stale handle/completion/teardown evidence covers the selected fallback. This path must keep hostile_hardware_isolation=not-claimed unless separate per-domain remapping evidence justifies a stronger provider-specific claim.
Unsupported. Devices whose DMA behavior cannot satisfy either candidate stay unbound or disabled. A serial boot result or PCI enumeration line is not enough to claim cloud NIC/storage readiness.

Downstream cloud driver preflights must declare the candidate backend and map their evidence to the assurance model’s invariants: no host-physical exposure, mapping before publication, no page reuse before teardown, stale-handle and stale-completion fail-closed behavior, domain-scoped aliasing only, bounded fail-closed holds, and explicit backend evidence. The evidence matrix is a support-policy record of advertised targets; the runtime probe, not the matrix, selects the backend on each boot.

Cloud DMA Backend

This section is the authoritative contract for how capOS selects a DMA backend for cloud NIC/storage devices. Selection is a runtime, fail-closed decision the kernel makes on each boot from what it can actually probe and validate, with an optional declarative override in the system manifest. There is no human sign-off in the selection path: the runtime probe decides by default, and the manifest override is config that an operator sets for a deployment, not a doc-signing ritual gated on any specific person. Downstream cloud NIC/storage driver slices consume this contract directly as their DMA-backend authority.

The preceding “DMA Assurance Model Checked Evidence And Cloud Backend Inputs” section defines the three backend candidates; this section adds the per-candidate trade-off analysis, the runtime selection rule, the manifest override field, and the downstream-contract scaffolding that a cloud NIC/storage driver declares. The research substrate is the provider evidence inventory Cloud DMA Provider Evidence Inventory, and the invariants and tool mapping are in DMA Assurance Model.

Provider-Written Addresses And No-IOMMU Brokered Bounce

Two DMA-address ownership models can be valid, but they do not apply to the same backend.

Provider-written, kernel-validated addresses (the NVMe Model B validator) are valid only when the provider’s device-visible address is not a host physical address: a verified direct-remapping/vIOMMU domain-scoped IOVA, or a future synthetic software address namespace that the manager translates before hardware sees it.
Brokered address publication is the no-IOMMU bounce-buffer model. The provider may own protocol state and buffer capabilities, but the kernel or device manager writes device-visible queue-base, PRP/SGL, or virtqueue address fields because those values are host physical or bus addresses on current no-IOMMU hardware.

Correction recorded 2026-05-27: the earlier reconciliation that treated a no-IOMMU bounce window as a provider-visible, non-host-physical device address space is not valid for the current implementation. On the run-pci-nvme no-IOMMU shape, DeviceDmaAllocation carries host physical pages and the reviewed IOVA export discipline keeps userspace IOVA/host-physical export disabled. Therefore a provider-written NVMe queue base or PRP on that gate would export a host physical address, violating the no-host-physical-exposure invariant. A bounce buffer protects data ownership and copy discipline; it does not create an untrusted-driver-safe IOVA namespace by itself.

The kernel on-notify DMA validator (kernel/src/cap/nvme_doorbell_validator.rs, validate_doorbell_scan) remains useful evidence for the provider-written model. On a queue-arm/CC.EN write and on an SQ tail doorbell it scans the device-visible addresses the provider published (queue bases; PRP1/PRP2 and one level of PRP-list indirection) and fails closed before the doorbell takes effect on any address that is not wholly within a window granted to the doorbell claim’s owner at the live generation: out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned, deeper-than-one-level PRP chain, or stale generation (ScanReject). Owner identity and live generation come from the grant ledger, never from provider-supplied metadata. A completion whose submission scan was never validated, or was validated under a now-retired generation, does not wake a waiter (completion_wakes_waiter), matching the stale-completion gate on the virtio-net path. That mechanism is the right fit for the QEMU direct-remapping lane and any future cloud shape that exposes guest-programmable remapping.

For the current GCP/no-IOMMU target, the storage path must use brokered bounce: userspace supplies typed commands, queue ownership intent, and live DMABuffer/buffer-cap handles; the manager materializes the actual device-visible queue-base and PRP/SGL fields and orders publication, teardown, copy, and scrub. That still leaves protocol-specific NVMe logic in userspace, but it does not let userspace author raw device addresses.

The brokered admin-queue enable landed 2026-05-27 (nvme-no-iommu-brokered-controller-enable, device_manager::nvme_brokered_admin_queue_enable). The provider allocates the admin submission/completion queue pages through its DMAPool cap and requests enable through the CC selected-write claim (CC.EN set); the manager resolves those pages from the live ledger (proof_buffers slots 0/1), validates the authored bases through validate_doorbell_scan (ScanKind::QueueArm), and authors AQA/ASQ/ACQ plus the CC.EN write itself. The provider never receives the host physical / device-visible queue-base address; CSTS.RDY=1 is observed only through brokered reads. This is the brokered model applied to the admin queue-arm; the steady-state SQ-tail doorbell over provider-written PRPs still needs the direct-remapping/synthetic-address lane above. Proof make run-pci-nvme; provenance docs/devices/nvme.md §6.

Provider-Side Isolation Versus Guest-Programmable Remapping

The decisive distinction for backend selection is between a DMA-capable surface and guest-programmable remapping authority:

SR-IOV (AWS ENA, Azure Accelerated Networking VF), a virtual NIC (gVNIC, virtio-net), a GPU/accelerator, or local NVMe identifies a device that does or could bus-master. This is a DMA-capable surface, not a safety property.
A direct-remapping classification requires a usable Intel VT-d, AMD-Vi, or Arm SMMU unit that the guest can discover, program, and validate, with translation/fault/invalidation behavior matching IOMMU Remapping Grounding. A DMA-capable surface alone never implies this.
Provider-side isolation facts (host-enforced VPC isolation, Nitro/host data-path bypass, hypervisor-side IOMMU) are support-policy assumptions a guest cannot prove from inside, not evidence that capOS can safely program direct DMA.

Runtime probing is authoritative for selecting the safe backend on a particular boot: capOS chooses from the device inventory, the remapping authority it can actually program, driver self-tests, and fail-closed probe results, and unknown or contradictory observations select the labeled bounce-buffer path or Unsupported. The cloud VM evidence matrix is the separate support-policy record for advertised targets and provider assumptions a guest cannot fully prove by itself; it does not override the boot-time runtime selection.

Candidate Trade-Off Analysis

Dimension	Direct remapping domain	Labeled bounce-buffer fallback	Unsupported
IOMMU coverage requirement	Requires a guest-programmable VT-d/AMD-Vi/SMMU unit capOS can program and validate per device.	None: used precisely when no usable guest IOMMU is exposed.	N/A: device stays unbound.
Cloud VM shape coverage (per inventory)	No probed GCE shape exposes a guest-programmable IOMMU; AWS/Azure shapes not yet probed. So no probed shape currently qualifies.	Indicated for shapes with a DMA-capable surface but no guest IOMMU (the probed GCE rows); fail-closed default for unproven shapes with a manager-ownable surface.	Ambiguous, contradictory, or unvalidated observations.
Per-operation cost	Translation only; no data copy. IOTLB/context-cache invalidation on teardown.	Copy between device-visible bounce pages and non-device memory on every transfer; Confidential VMs force this in hardware regardless.	None.
Hostile-smoke coverage today	Bounded QEMU Intel path only (`make run-iommu-remapping`, `ddf-iommu-remapping-production-closeout`); no cloud guest-IOMMU evidence.	S.11.2.7/8/9 rows enforced by the `make run-net` gate (`tools/qemu-net-smoke.sh`), with bounce-buffer virtio-net provider evidence in `ddf-provider-virtio-net-driver-closeout`; bounce-buffer `DMAPool` lifecycle by `make run-dmapool-grant`. The GCP-shape local binding precursor (`cloud-gcp-virtio-net-local-qemu-binding`) asserts, in both `make run-net` and `make run-ddf-provider-consumer`, that the enumerated/bound function matches the documented GCP 1st/2nd-gen virtio-net device surface (standard virtio-net, vendor `0x1af4`) and that the resolved backend is this labeled bounce-buffer path; it does not claim live GCP enumeration.	N/A.
Hostile-hardware isolation claim	Claimable only with per-domain remapping evidence and the IOMMU hostile smokes; not yet established for any cloud shape.	`not-claimed`: a malicious bus-mastering device without a trusted remapping domain can still write arbitrary RAM.	N/A.

The GCE live-probe rows in the evidence inventory record that every probed GCE shape (1st-gen n1, 2nd-gen e2, 3rd-gen Intel c3, and AMD-SEV Confidential n2d) boots with intel_iommu=off, DMAR: IOMMU disabled, SWIOTLB software bounce buffering, empty /sys/kernel/iommu_groups, and no DMAR/IVRS/IORT table; the Confidential shape forces bounce buffering as a memory-encryption invariant. These rows are support-policy expectations, not a hardcoded selection table: they describe what capOS’s runtime probe should expect to find on those shapes today, and they confirm that the fail-closed default lands on the labeled bounce-buffer path there. The boot-time probe, not this matrix, makes the binding selection on each boot, so a shape whose IOMMU exposure changes is handled by the probe re-evaluating rather than by editing this text. AWS and Azure shapes carry no live-probe evidence yet; the probe treats them the same as any other unproven platform and defaults to the bounce-buffer path.

Runtime Selection Rule (Fail-Closed Default)

On each boot, capOS probes the platform for guest-programmable remapping authority — IOMMU presence, programmability, and coverage for the DMA-capable functions it intends to bind — and selects the backend fail-closed:

Probe the platform. Discover DMA-capable functions, then test whether a usable Intel VT-d, AMD-Vi, or Arm SMMU unit is present, discoverable, and programmable, and whether its translation/fault/invalidation behavior passes the self-tests in IOMMU Remapping Grounding.
Select fail-closed. Select the direct remapping domain backend for a device only when the probe positively verifies a usable+safe IOMMU for that device. If the probe cannot verify it — IOMMU absent, not programmable, coverage unproven, self-test failed, or observations ambiguous — select the labeled bounce-buffer fallback for any DMA-capable surface the manager can keep manager-owned, or Unsupported when even that cannot hold. This probe-gated rule is the default: on an unproven platform the probe cannot verify, so direct DMA is not used and the bounce-buffer path is chosen.
There is no human in this loop. The machine decides per boot; the only external authority is the optional manifest override below.

This is the boot-time authority. The cloud VM evidence matrix above is the support-policy expectation of what the probe should find, not the decision itself.

Manifest Override Field (Operator Authority Lever)

An operator can override the runtime default for a deployment through one declarative, auditable enum field in the system manifest’s kernel parameters: the dmaBackendPolicy field of the SystemConfig struct in schema/capos.capnp. It is config, not a doc-signing ritual, and is not gated on any specific person. The field is absent by default, and the absent default applies the probe-gated runtime selection rule above. The enum values and their interaction with the probe result are:

Value	Probe verifies usable+safe IOMMU	Probe cannot verify	Notes
(field absent)	direct remapping domain	bounce-buffer fallback	Default: the probe-gated runtime selection rule. Direct-DMA when the probe verifies a usable+safe IOMMU, bounce-buffer fallback otherwise (fail-closed). Identical to `enable-if-verified`.
`enable-if-verified`	direct remapping domain	bounce-buffer fallback	The explicit, auditable form of the default. Probe-gated direct-DMA with fail-closed bounce-buffer fallback. Redundant with the absent default but kept for explicit configuration.
`enable-unsafe`	direct remapping domain	direct remapping domain	Force direct-DMA even when the probe cannot verify it. The operator takes responsibility for the platform’s DMA isolation; the value name carries the warning. Use only on a platform whose isolation is known-good out of band.
`bounce-buffer`	bounce-buffer fallback	bounce-buffer fallback	Pin the labeled bounce-buffer path and disable direct-DMA entirely, even where the probe would verify a usable IOMMU. The most conservative value.

Selection rules that hold for every value:

The absent default and enable-if-verified select direct DMA only when the probe verifies a usable+safe IOMMU, and otherwise fall back to bounce-buffer.
enable-unsafe is the sole value that can pick direct DMA without probe verification. The value name is the acknowledgement; there is no separate per-shape “I-accept-unverified” ceremony.
bounce-buffer never selects direct DMA, even where the probe would verify a usable IOMMU.
When the selected backend is Unsupported for a device (no manager-ownable DMA-capable surface at all), the device stays unbound regardless of the override value. The override governs direct-vs-bounce, not whether an unbindable device is forced online.

This selection mechanism is implemented. The dmaBackendPolicy capnp enum encodes an absent field as ordinal 0 (unspecified), which decodes identically to enable-if-verified; an unrecognized ordinal decodes fail-closed to the bounce-buffer path (never direct DMA) rather than failing the manifest parse or honoring the probe-gated default. The kernel resolves the backend on each boot from the IOMMU probe verdict and this override and emits a boot proof line of the form dma: backend selection dma_backend=<direct-remapping|bounce-buffer> dma_backend_override=<absent|enable-if-verified|enable-unsafe|bounce-buffer> probe_verified_usable_iommu=<bool>. The bounded QEMU shapes prove the probe-gated default end-to-end: make run-iommu-remapping (verifiable Intel VT-d shape) records dma_backend=direct-remapping dma_backend_override=absent, and make run-dmapool-grant (no usable IOMMU) records dma_backend=bounce-buffer dma_backend_override=absent. The override values and the unknown-ordinal fail-closed decode are covered by cargo test-config over the shared selection rule, capnp round-trip, and CUE decode.

The production cloud (non-qemu) build emits the same backend selection on the cloudboot harness’s serial-port-1 path as cloudboot-evidence: dma-backend bounce_buffer (in the harness token namespace), and on the bounce-buffer verdict drives the production-path bounce-buffer DMAPool + DMABuffer grant proof (kernel/src/cap/dmapool_bounce_buffer_grant_proof.rs, called from kernel::run_init): stage a parked manager-attached DMAPool over one DMA-capable PCI function from the inventory through device_manager::stage_bounce_buffer_dmapool_record (kernel/src/device_manager/stub.rs), allocate one bounded bounce-buffer DMABuffer through device_manager::issue_manager_attached_dmabuffer_handle_with_request (which calls device_dma::allocate_manager_attached_dmapool_bounce_buffer_page), assert the cap-info labels (userspace_dmapool=manager-issued-bounce-buffer, allocation=single-bounce-buffer-page, real_dma=not-attempted, direct_dma=blocked, host_physical_user_visible=0, iova_export=disabled-future-only), quiesce-before-release (release_dmapool_record_for_cap_release returns pending-buffer-release while the buffer is live), scrub-before-reuse (the released bounce-buffer frame is zeroed in place between scrub and frame-free), and stale-handle-after-detach, then emit cloudboot-evidence: dma-pool-grant <token> with shape <seg>.<bus>.<dev>.<fn>-pool.<slot>.gen.<gen>-phys.<hex> (every character inside the harness grammar [A-Za-z0-9._-]+). The proof stages a pool, allocates one bounce-buffer page, asserts the invariants, and emits a marker: it does not program real DMA, attach a queue, program interrupts, claim a device for sustained ownership beyond the grant, or emit provider-nic-bound / storage-bound. A future direct-remapping verdict skips the proof rather than aliasing direct-DMA onto the bounce-buffer assertion shape.

Implementation note, 2026-05-29 11:50 UTC: the production-cloud bounce-buffer stub now implements the cap-side DMABuffer.map(R+W) / DMABuffer.unmap admission and state-machine entry points (validate_dmabuffer_map_admission, record_dmabuffer_user_mapping, begin_dmabuffer_user_mapping_unmap, restore_dmabuffer_user_mapping, clear_dmabuffer_user_mapping in kernel/src/device_manager/stub.rs) so a userspace holder of a manager-issued DMABuffer cap can map the single bounce-buffer page read/write, write payload bytes through that mapping, unmap it, and observe DMAPool.info.mapped_vmas reflect the live mapping (make run-cloud-dmapool-grant). The same Absent -> Mapped -> Unmapping -> Absent state machine the QEMU side enforces governs the parked slot: a duplicate map fails closed with the DmaPoolLive shape, a clear before the cap-side aspace unmap returns DmaPoolTeardownEvidenceInvalid, an identity-mismatched begin/clear fails closed, and a post-freeBuffer map fails closed with the DmaBufferStaleHandle shape. The manager remains the single owner of the bounce-buffer page’s device-visible host-physical address and IOVA: the mapping does not expose either to userspace, real DMA stays not-attempted, direct DMA stays blocked, and IOVA export stays disabled-future-only. This is a local-QEMU proof of the userspace mapping path only; it does not unlock a live cloud NIC bind, IOMMU programming, or production direct-DMA authority.

Implementation note, 2026-06-03: Phase C slice 2 (cloud_virtio_net_userspace_ownable_vring_proof, make run-cloud-prod-nic-driver-userspace-ownable-vring) wires this landed bounce-buffer authority to a userspace virtio-net driver’s own vring without adding a new isolation backend. The driver allocates its descriptor / available / used ring pages through a granted DMAPool, and the kernel programs the virtio queue-address registers (queue_desc / queue_driver / queue_device) with the manager-owned bounce host-physical address. The brokered-address- publication model (above) holds: the driver never authors a device address. It writes an opaque per-buffer device-usable handle (exported through DMABuffer.info.deviceIova with scope bounce-handle, a deterministic non-address encoding of the buffer’s manager identity), and the kernel resolves that handle against the live grant ledger (device_manager::stub::resolve_virtio_vring_device_address) to the real host-physical address before any MMIO write. The no-host-physical-exposure invariant is preserved end-to-end: host_physical_user_visible=0, iova_export=disabled-future-only, and reads of the queue-address base registers are refused so the resolved address is never read back into userspace. Out-of-grant, host-physical-looking, and stale-generation handle writes fail closed, mirroring the NVMe doorbell-scan reject classes. queue_enable / DRIVER_OK stay fail-closed (slice 3); this is a local-QEMU proof of the userspace-ownable-vring path only, not a live cloud NIC bind.

Downstream-Contract Scaffolding

A cloud NIC/storage driver declares its chosen backend through the device-manager policy fields that already label the local paths in this design. The contract below scaffolds the values each candidate requires; it is the shape a driver preflight declares, not an authorization to enable any value.

Policy field	Direct remapping domain	Labeled bounce-buffer fallback	Unsupported
`direct_dma`	`enabled` for the programmed per-device domain	`blocked`	`blocked`
`trusted_domain`	manager-owned domain id for the device	`none`	`none`
`bounce_buffer`	not required (mapped IOVA path)	`required`	N/A (device unbound)
`remapping_tables`	`programmed`	`not-programmed`	`not-programmed`
`hostile_hardware_isolation`	claimed only with per-domain remapping evidence + IOMMU hostile smokes	`not-claimed`	N/A
`exported_device_addresses`	`iova-only`, domain-labeled	none (no host physical or IOVA exposed)	none

Each candidate must satisfy the following gates, mapped to the assurance-model invariants in DMA Assurance Model:

Direct remapping domain — mapping before publication; no page reuse before teardown, with mapping removal and observed invalidation completion ordered strictly before scrub/reuse; stale-handle and stale-completion fail-closed; domain-scoped aliasing only (per-device context entries where a peer domain cannot resolve another domain’s IOVA); no host-physical exposure (export only the domain-scoped IOVA, labeled as meaningless outside the domain); backend evidence explicit. The only landed remapping-domain evidence is the bounded QEMU Intel path in ddf-iommu-remapping-production-closeout, exercised by make run-iommu-remapping. A cloud shape needs its own guest-programmable remapping evidence before this candidate applies to it.
Labeled bounce-buffer fallback — direct_dma=blocked; all device-visible memory is manager-owned bounce pages; no host physical address and no generic MemoryObject/FrameAllocator authority is exposed to the driver; stale-handle, stale-completion, and exit-under-DMA teardown fail closed; hostile_hardware_isolation=not-claimed stays explicit. The landed evidence is the S.11.2.7/8/9 hostile-smoke rows enforced by the make run-net gate (tools/qemu-net-smoke.sh), with the bounce-buffer virtio-net provider evidence in ddf-provider-virtio-net-driver-closeout, plus the bounce-buffer DMAPool grant lifecycle in make run-dmapool-grant. As of ddf-real-dma-s112-hostile-smokes, the S.11.2.7 stale-IRQ-after-reset and S.11.2.8 stale-DMA-completion-after-reset closure summaries are asserted on the dedicated make run-dmapool-grant gate as well, so the DMA-grant gate fails closed on a hostile-row regression without depending on make run-net; exit-under-DMA (in-flight drain, scrub, page free) is enforced by make run-dmapool-grant-exit. See the “Fallback Policy For No Usable IOMMU Exposure” and “Hostile-Smoke Acceptance Matrix” sections in this document for the full gate text.
Unsupported — the device stays unbound or disabled; no driver-visible DMA, MMIO doorbell, interrupt ownership, or storage/network readiness claim is made. A serial boot line or PCI enumeration line is not readiness evidence.

The models/dma/ TLA+ and Alloy files, the extracted Kani core, and the focused Loom model are checked bounded evidence for these invariants. They supplement the QEMU/host evidence above; they do not satisfy a candidate hardware or cloud backend gate by the mere presence of model files. Each checked result records its tool version, command, bounds, and output in ../models/dma/README.md, and the CI placement is tracked in Security and Verification.

First QEMU Intel Remapping Smoke Acceptance Gate

Decision recorded 2026-05-14 09:07 UTC. The first slice that programs real Intel VT-d remapping state under QEMU has an explicit, bounded acceptance gate. This unblocks the ddf-iommu-qemu-intel-remapping-smoke task, whose ## Acceptance section carries the full gate; the summary here is the design-level decision.

The first slice programs the minimum Intel VT-d path for exactly one selected DMA-capable function under a pinned QEMU q35 -device intel-iommu,aw-bits=39 shape: one root entry, one context entry bound to a device-manager-owned domain ID and a 39-bit second-level page-table root, and a single second-level mapping from one device-visible IOVA page to one kernel-owned DMAPool page, plus the root-table-address register write and the global-command/global-status translation-enable handshake. Acceptance requires observable proof that the mapped IOVA was translated and that an out-of-domain IOVA faults closed in the fault-status/fault-recording registers.

Invalidation is part of the gate, not a follow-up. On revoke, device reset, driver death, and DMAPool page release, the slice must remove the second-level mapping and invalidate the relevant context-cache and IOTLB state through the selected invalidation interface, observe invalidation completion, and order page scrub/reuse strictly after that completion. This sits at the existing “remove IOMMU mappings” and “scrub and free DMA pages” steps of the DeviceOwnerState revocation order and must not be reachable before QueuesQuiesced or a completed Resetting transition.

Teardown ordering for the VT-d structural frames themselves (root table, context table, second-level tables) adds one requirement on top of the mapping/invalidation ordering above: no structural table frame may be returned to the frame allocator while translation is enabled and the root-table address register (RTAR) still latches it. When the revocation frees those frames — as opposed to holding them on a fail-closed path — it first disables translation (clear GCMD.TE, observe GSTS.TES clear) and re-latches the root-table pointer at a null address (RTADDR=0 then GCMD.SRTP, observe GSTS.RTPS), then scrubs and frees. This closes the window where the IOMMU could walk a freed (and potentially reallocated) frame as a live VT-d structure. In the QEMU Intel VT-d smoke this is proven by the invalidation proof line’s translation_disabled_before_table_free=true and root_table_pointer_cleared_before_table_free=true markers, which gate the success-path table-frame free; a failed teardown poll fails closed and holds the frames in the ledger. AMD-Vi / scalable-mode / interrupt-remapping teardown stays out of scope.

IOVA export discipline: host physical addresses stay hidden from userspace in every result cap, diagnostic line, and audit record. The selected QEMU Intel IOMMU-backed DMABuffer.info path may export only the domain-scoped IOVA for the live mapped buffer generation, explicitly labeled as meaningless outside that domain; fallback bounce-buffer paths keep IOVA export disabled.

Per-device domain granularity: the selected QEMU Intel path programs two distinct per-device context entries and second-level roots for two claimed DMA-capable functions under the same DRHD. Both domains may use the same IOVA, but the peer domain’s second-level walk is proven not to resolve that IOVA to the primary page; stale and wrong-owner domain assignment fail closed, and trusted multi-device sharing groups remain disabled.

The kernel-owned bounce-buffer fallback stays the path for VM shapes without usable remapping hardware and must remain explicitly labeled (fallback_policy=kernel-owned-bounce-buffer-only, remapping_tables=not-programmed, hostile_hardware_isolation=not-claimed); it must never be silently reinterpreted as direct-DMA or hostile-hardware isolation. The IOMMU-backed path adds stale-DMA-handle, stale-completion, descriptor-abuse, revoke/reset-race, teardown-under-DMA, cross-domain stale-handle, and fail-closed teardown branch hostile smokes while the existing bounce-buffer exit-under-DMA and stale-DMA evidence (the device-dma stale-handle and stale-completion proofs and the S.11.2.7/S.11.2.8 closure summaries) is preserved unchanged.

Explicitly out of scope for this first slice: AMD-Vi table programming; trusted multi-device sharing groups; scalable-mode context entries; interrupt remapping and device-IOTLB options; 48-bit IOVA space / 4-level tables; production NIC or storage driver ownership; userspace DMAPool direct-DMA authority; and moving the live virtio-net path off bounce buffers. The acceptance evidence is QEMU-only emulator evidence, not a hardware-isolation claim. The smoke adds or selects a focused make run-iommu-remapping gate asserted by tools/qemu-iommu-remapping-smoke.sh.

Implementation note, 2026-05-02 04:58 UTC: the ACPI discovery path recognizes DMAR and IVRS in the root table walk, reports absent/valid/malformed/unsupported state, records bounded table length/header facts, DMAR host address width and flags, IVRS IVinfo/flags, and bounded remapping-structure type counts. Malformed DMAR or IVRS structure lengths stop parsing, and unsupported shapes such as parser scan-cap overflow leave direct_dma=blocked with bounce_buffer=required.

Implementation note, 2026-05-02 05:31 UTC: the attachment-policy slice also retains DMAR DRHD include-all and bounded PCI endpoint device-scope metadata, including segment, single-hop BDF, and remapping-hardware register base, and reports each retained DMA-capable PCI function as IOMMU-attached/covered when that static table metadata covers its segment/BDF. Bridge and multi-hop scopes are diagnostic-only until PCI topology traversal can resolve them, and include-all fallback fails closed when retained DRHD units or scopes are capped. Functions without trusted static coverage are reported as uncovered; covered functions are reported as attached/covered, but both paths keep dma_policy=prototype-bounce-buffer-only, bounce_buffer_required, and blocked_direct_dma_devices because remapping domains are unsupported. The direct-DMA trusted-domain count remains zero, and userspace DMAPool, DeviceMmio, and Interrupt authority remain unavailable.

Implementation note, 2026-05-02 07:27 UTC: the domain-policy staging slice adds a pci: dma-domain policy proof line and a diagnostics mirror. The proof reports the future domain owner as the device manager, the domain granularity as per-device or trusted-sharing-group, exported device addresses as IOVA-only, host_physical_user_visible=0, direct_dma_trusted_domains=0, claimed_device_domains_ready=0, remapping_tables=not-programmed, remapping_domains=not-started, userspace DMAPool/DeviceMmio/Interrupt as not-started, and prototype devices as kernel-owned bounce-buffer-only. Malformed, unsupported, absent, or retained-capped metadata leaves direct DMA blocked; proof_result=ok is only evidence for that conservative blocked-direct-DMA policy.

Implementation note, 2026-05-09 18:47 UTC: the blocked-direct-DMA admission decision now lives in the pure capos-lib::device_authority validator next to the DMA/MMIO/IRQ handle validators. Host tests cover the current all-prototype bounce-buffer shape, fail-closed results if any direct trusted domain is claimed before the policy is ready, fail-closed results if the prototype bounce-buffer count does not cover every DMA-capable function, and the absent, malformed, unsupported, and retained-capped metadata labels. The kernel PCI proof line and diagnostics mirrors consume that pure decision while preserving the existing direct_dma=blocked, remapping_tables=not-programmed, domain_activation=not-started, and policy=blocked-direct-dma labels. This is IOMMU/remapping groundwork only; it does not program remapping tables, create trusted domains, expose host physical addresses, or enable production userspace DMA authority.

Implementation note, 2026-05-02 15:29 UTC: the COM1 devices diagnostics command now prints the same bounded DMA-domain policy facts without naming an owner identity. The line explains that all current DMA-capable prototype functions remain on direct_dma=blocked, bounce_buffer=required, direct_dma_trusted_domains=0, claimed_device_domains_ready=0, remapping_tables=not-programmed, exported_device_addresses=iova-only, host_physical_user_visible=0, and prototype_devices=kernel-owned-bounce-buffer-only. This is a diagnostics mirror for the current conservative policy, not evidence that IOMMU remapping domains or userspace DMA authority exist.

Implementation note, 2026-05-02 15:45 UTC: attached device-manager DMAPool records now store the current explicit bounce-buffer policy and the QEMU device-manager proofs read it back through the active device record plus the matching DmaPoolHandle. The logged policy scope is device-manager-attached-dmapool-bounce-buffer-policy, with direct_dma=blocked, bounce_buffer=required, trusted_domain=none, remapping_tables=not-programmed, remapping_domain=not-started, userspace_dmapool=not-started, host_physical_user_visible=0, and policy_bound_to_manager=true. This binds the conservative policy to current manager state; it still does not program remapping domains, expose userspace DMAPool, or perform real DMA mapping teardown.

Implementation note, 2026-05-11 00:00 UTC: attached device-manager DMAPool policy records now also carry an explicit manager-owned remapping-domain ledger staging record. The lifecycle and imported-live proofs report remapping_domain_ledger_scope=device-manager-attached-dmapool-remapping-domain-ledger, static_iommu_coverage=acpi-pci-diagnostic-only, remapping_domain_owner=device-manager, remapping_domain_granularity=per-device-or-trusted-sharing-group, remapping_domain_ledger=manager-owned-staging-record, remapping_domain_ready=false, and iova_export=disabled-future-only, while preserving direct_dma=blocked, remapping_tables=not-programmed, and host_physical_user_visible=0. This is a software ledger/readiness record only: capOS still does not program Intel VT-d/AMD-Vi/QEMU remapping tables, create a trusted direct-DMA domain, expose host physical addresses or IOVAs, or claim production hostile-hardware DMA isolation.

Implementation note, 2026-05-11 17:07 UTC: the same manager-owned remapping-domain staging record is now an explicit activation gate tied to the active DMAPool record and matching handle. The device manager validates that gate before current DMAPool policy/accounting and buffer issue paths proceed. The gate reports domain_ownership=manager-owned-active-dmapool, but keeps direct_dma=blocked because remapping_table_programming=not-programmed, iova_export=disabled-future-only, remapping_invalidation_policy=not-installed, remapping_iotlb_flush_policy=not-installed, and remapping_stale_mapping_cleanup=not-installed; the selected fallback remains remapping_fallback_policy=kernel-owned-bounce-buffer-only. The activation result is blocked-remapping-prerequisites-missing with remapping_activation_gate=fail-closed, remapping_activation_blocker=remapping-tables-not-programmed, and remapping_activation_side_effect=side-effect-blocked. This is a software policy gate and proof surface only: capOS still does not program Intel VT-d/AMD-Vi/QEMU remapping tables, create a trusted direct-DMA domain, export IOVAs or host physical addresses, remove real IOMMU mappings, flush IOTLB state, or prove IOMMU-backed hostile stale-DMA behavior.

Implementation note, 2026-05-12 18:49 UTC: the manager-owned staging record now includes a concrete per-device remapping-domain identity for the active DMAPool handle: claimed-device domain identity, staged single-device sharing group, BDF-derived device id, pool slot, pool generation, and owner generation. The activation preflight treats that identity binding as a prerequisite before direct DMA could be considered, and the QEMU lifecycle/imported-live proofs emit a dmapool remapping domain identity proof line. The direct-DMA blocker remains unchanged: remapping tables are still not-programmed, IOVA export is still disabled-future-only, invalidation, IOTLB flush, and stale mapping cleanup are still not-installed, direct DMA remains blocked, and the fallback remains kernel-owned-bounce-buffer-only.

Implementation note, 2026-05-12 21:19 UTC: the same manager-owned remapping-domain ledger now carries a separate mapping-lifecycle preflight record. The record is bound to the active DMAPool handle and claimed-device domain identity, and the existing device-manager policy gate validates it before accepting the current bounce-buffer attach/accounting/buffer-issue paths. Its direct-DMA result remains fail-closed with explicit blockers: IOVA space, mapping install, removal before page reuse, invalidation policy, IOTLB flush policy, and stale mapping cleanup are all not-installed. This is still an in-repo software preflight only; it does not program remapping tables, expose IOVAs or host physical addresses, enable direct DMA, remove real IOMMU mappings, flush IOTLB state, or prove IOMMU-backed hostile stale-DMA behavior.

Implementation note, 2026-05-12 22:26 UTC: capOS now has an Intel/QEMU remapping table scaffold that can represent a DRHD identity field, PCI segment/BDF/source ID, domain ID, QEMU Intel address-width choice, disabled root/context entries, and a second-level page-table-root placeholder. PCI diagnostics can bind that scaffold to discovered DRHD/segment metadata when ACPI/PCI discovery provides it. The disabled backend registry’s only accepted active state is disabled. The proof labels distinguish representability from programming: root-table pointer, context-entry programming, invalidation registers, fault registers, protected-memory registers, and invalidation queue remain not-written; remapping tables remain not-programmed; hardware programming remains not-attempted; direct DMA remains blocked; IOVA export remains disabled-future-only; and host physical addresses remain hidden from userspace. This is still not Intel VT-d, AMD-Vi, or QEMU IOMMU programming.

Implementation note, 2026-05-23 18:06 UTC: the first production DMAPool ledger integration for the QEMU Intel remapping path now maps the selected virtio-rng request-buffer IOVA to an active manager-owned DMABuffer page. Mapping install is admitted through the matching active DmaPoolHandle and DmaBufferHandle generations, stale pool and buffer generations fail closed, and wrong-owner mapping attempts are side-effect blocked. On teardown, the target second-level leaf is removed, the context-cache and IOTLB completion polls finish, and only then is the DMABuffer released through the production device-manager ledger. The proof keeps the IOVA internal, keeps host_physical_user_visible=0, keeps userspace IOVA export disabled for this slice, and leaves the no-remapping fallback policy as kernel-owned-bounce-buffer-only.

Implementation note, 2026-05-23 19:18 UTC: QEMU Intel remapping fault reporting now decodes VT-d FSTS plus FRCD[0] into a bounded kernel record for the faulting IOVA, reason, requester source ID, and DMA read/write type. The unmapped-IOVA, stale-handle, and stale-completion proofs record the fault, clear it with write-1-to-clear semantics, verify the clear-after-record state, and report source/IOVA match status without exposing host physical addresses. The COM1 devices diagnostics path now prints an IOMMU fault summary that reserves fault_summary=clean for a successful clear fault-status read and labels unavailable fault-status reads as unavailable/fail-closed. Owner identity, DRHD register bases, and host physical addresses stay hidden. The optional audit route is explicitly not wired for this slice (volatile_audit=not-routed).

Implementation note, 2026-05-13 15:29 UTC: active manager-owned DMAPool remapping preflight records now consume the same retained DMAR DRHD/requester metadata when PCI coverage is complete. The active disabled Intel/QEMU scaffold records the retained DRHD identity and requester segment/BDF/source ID for the bound pool handle, but absent, malformed, capped, unsupported, or uncovered metadata still leaves the scaffold not-bound and disabled. This is metadata-only binding: capOS still does not program root/context tables, install or remove mappings, invalidate remapping caches, flush IOTLB state, export IOVAs, enable direct DMA, or expose host physical addresses.

Implementation note, 2026-05-12 23:07 UTC: PCI diagnostics now include a separate Intel/QEMU remapping MMIO-status proof for the selected DMA-capable function. When complete retained DMAR DRHD metadata covers that function and the register base is page-aligned, capOS maps only the selected remapping-register page for bounded volatile diagnostic reads of the version, capability, global-status, root-table-address, and fault-status registers. The mapped label describes the diagnostic access pattern, not page-table write protection. When the default diagnostics shape has no DRHD, metadata is capped, or the retained DRHD base is invalid, the same proof reports mmio_window=not-mapped, mmio_read=not-attempted, unavailable capability/status/fault reads, and a fail-closed reason. The labels preserve remapping_tables=not-programmed, direct_dma=blocked, fallback_policy=kernel-owned-bounce-buffer-only, and hostile_hardware_isolation=not-claimed. This is not remapping-domain activation: capOS still does not write VT-d, AMD-Vi, or QEMU remapping registers, install root/context tables or invalidation queues, export IOVAs, or claim hostile-hardware DMA isolation.

Implementation note, 2026-05-13 01:20 UTC: active manager-owned DMAPool records now also carry a generic disabled IOVA ledger under the remapping-domain record. The ledger binds a domain-scoped reservation identity, hidden internal range metadata, and reservation generation to the active owner and pool generation. The same device-manager accounting path validates that ledger before current bounce-buffer allocation, descriptor submission, completion accounting, buffer free/page release, and pool release checks proceed. Pure tests and QEMU proof labels reject stale reservation generations and wrong owner generations as disabled-iova-ledger-stale with side effects blocked. The active state remains disabled: proof output keeps the internal range hidden and reports iova_base_user_visible=0, host_physical_user_visible=0, iova_export=disabled-future-only, direct_dma=blocked, mapping_install=not-installed, mapping_remove_before_page_reuse=not-installed, invalidation_policy=not-installed, iotlb_flush_policy=not-installed, and stale_mapping_cleanup=not-installed. This still does not program remapping tables, export IOVAs, expose host physical addresses, install or remove real IOMMU mappings, flush IOTLB state, or claim hostile-hardware isolation.

Implementation note, 2026-05-11 18:44 UTC: the selected userspace virtio-net TX provider smoke now grants a runtime-visible DeviceMmio notify BAR cap named notify_mmio, but keeps the active DMA posture unchanged. The provider still uses manager-owned bounce buffers, direct_dma=blocked, and host_physical_user_visible=false; the notify cap is a no-write MMIO admission boundary over the selected virtio-net TX notify offset, not a direct DMA or descriptor-ring ownership transfer. The selected submit path validates descriptor authority and scrubs the bounce page before consuming the grant-derived notify policy, and it proves wrong value, wrong offset, stale handle, and stale generation block before any doorbell. This does not program IOMMU/remapping tables, export IOVA or host physical addresses, mutate the real virtio-net descriptor ring, or claim production NIC isolation.

Implementation note, 2026-05-11 19:05 UTC: the notify_mmio grant remains runtime-visible but is now explicitly no-direct-MMIO as well as no-write. DeviceMmio.map and DeviceMmio.read32 for that provider notify cap return typed blocked results before any user VMA or register read, DeviceMmio.info validates the live mapping generation before accepting the provider-notify record, and notify_mmio detach clears the submit-path notify policy so a later selected submit reports stale-handle blocking with no doorbell write. The submit path also invalidates the cached notify policy on owner-generation transitions and stale/missing cap-release detach boundaries before accepted no-write authority can be reported.

Implementation note, 2026-05-11 19:56 UTC: the same provider smoke now grants a runtime-visible Interrupt cap named tx_interrupt for the selected virtio-net TX MSI-X route snapshot. This extends the grantable authority boundary without changing the DMA posture: the provider still uses manager-owned bounce buffers, direct DMA and IOMMU remapping remain blocked, and the interrupt cap only proves generation-checked admission for info/ack/unmask/wait/mask plus waiter cancellation and stale-after-release blocking. Later bounded follow-ups add selected TX MSI-X vector-control mask/unmask; this first grant slice did not deliver provider IRQs to userspace, acknowledge or mask/unmask hardware, ring a doorbell, or mutate real virtio-net descriptor rings.

Implementation note, 2026-05-11 20:09 UTC: provider TX waiters are now separate no-delivery waiter-table entries rather than generic delivery waiters. A pending tx_interrupt.wait remains pending across TX route delivery-count advancement and only completes through the explicit mask/release cancellation path. The staged provider TX grant source also tracks a live-issued cap and refuses another live tx_interrupt alias for the same route snapshot.

Implementation note, 2026-05-12 09:13 UTC: the selected userspace virtio-net TX provider path now performs one bounded real descriptor/avail publication on queue 1 while keeping the DMA posture conservative. The descriptor points at the manager-owned bounce buffer already governed by the DMABuffer record; the submit path validates live buffer identity, scrubs before publication, requires the live no-write notify_mmio policy, asserts that the descriptor page is ledgered to the virtio-net TX queue, and blocks wrong queue, stale notify policy, and a real stale DMABuffer.submitDescriptor(queue=1) attempt at the stale capability/liveness boundary before touching the real ring. The proof logs descriptor/avail/used ring physical addresses for kernel evidence, but those addresses are not returned to userspace. Because this slice does not claim real used-ring/CQ completion, the published page remains pinned in the manager in-flight record; userspace completeDescriptor, freeBuffer, post-publication remap, and cap-release drain do not retire, remap, or release it. A follow-up rings one selected TX virtqueue notify doorbell after the same descriptor authority, submit-scrub, live notify_mmio policy, submit-effect write, and publication gates, while wrong-queue and stale-notify or stale-DMABuffer paths remain not-written. Readback-mismatch publication failures do not write the immediate doorbell and are treated as possibly-published ring state that quiesces later TX notification and keeps the manager buffer pinned rather than claiming rollback. Pre-publication bounce-page metadata remains doorbell_write=not-written. Any immediate used-ring or IRQ effect from that doorbell is recorded only as an out-of-scope hardware side effect. Later 2026-05-12 follow-ups advanced the selected path to a bounded used-ring completion handoff: DMABuffer.completeDescriptor validates the live manager-attached buffer and in-flight descriptor id, observes the real TX used ring for the stored software descriptor generation, consumes that entry through the existing descriptor tracker and DMA ledger, and only then clears the manager in-flight record. As of commit e248d42b (2026-05-23 13:36 UTC), kernel TX helpers stay quiesced after provider ownership starts while the provider path can publish the full selected TX queue-depth window of eight descriptors before the first completion; the smoke records live_inflight_after_submits=1/2/3/4/5/6/7/8 (the ninth allocation rejected dmapool-budget-exceeded), blocks map/free/reuse while any buffer is in flight, and proves wrong-order descriptor 7 used-ring handling preserves the observed descriptor 0 completion for its matching generation. The provider-facing tx_interrupt waiter is a runtime-visible completion-event consumer for the same selected route; delivery validates the expected TX source id, source generation, route generation, owner, driver-unmasked state, and live issue id before completing each bounded event. A 2026-05-13 follow-up adds the bounded incomplete-descriptor teardown drain: when one descriptor has completed and seven remain incomplete, release retires only the incomplete descriptors’ allocation-backed TX DMA ledgers and clears only the selected virtqueue descriptor/used-ring tracking needed for those releases, while CQ publication and provider IRQ delivery stay blocked and the pending waiter remains undelivered until the smoke explicitly cancels it through the existing mask/cancel path. Commit e248d42b (2026-05-23 13:36 UTC) extends that drain to the full selected TX queue-depth window and keeps the completed descriptor’s buffer retained until it is explicitly freed. A later 2026-05-13 remediation binds each provider TX in-flight descriptor to the submission-time provider issue/source/route generation. If that old descriptor completes after tx_interrupt release/regrant, DMABuffer.completeDescriptor fails closed as dmabuffer-provider-tx-stale-issue before consuming the used ring, publishing provider CQ/IRQ state, or advancing provider acknowledgements; later cap release may still drain the descriptor as teardown-only. tx_interrupt.wait posting is serialized with provider release, mask, and delivery, and stale issue ids fail closed at admission and insertion. A later 2026-05-13 follow-up lets tx_interrupt.acknowledge account exactly one already observed selected TX dispatch token paired with one delivered provider CQ event; the smoke proves pre-event, duplicate, teardown-drain, masked-route, reset/regrant, stale-after-release, and stale issue acknowledgements fail closed before delivery-count, route-state, CQ, ack-ledger, or hardware-dispatch-ack mutation. This is still bounded selected-route evidence: provider IRQ ownership, deferred EOI, LAPIC/MSI-X acknowledgement, direct DMA, IOMMU mapping, full virtio-net ownership, production NIC/storage migration, and cloud readiness remain open. Commit e248d42b (2026-05-23 13:36 UTC) adds release-time retirement for delivered but unacknowledged bounded provider TX CQ events: the release proof now records seven pending provider completion acks retired from the ledger in one live issue while preserving the separate stale-bound in-flight descriptor proof, with stale post-release ack revoked and no hardware ack claimed. A 2026-05-13 follow-up adds bounded selected TX MSI-X mask/unmask only: live provider tx_interrupt.mask and unmask toggle the selected TX vector-control bit plus route state, preserve generations and delivery counts, and block stale issues before side effects.

Implementation note, 2026-05-02 19:43 UTC: the bounded zero-live device-manager DMAPool lifecycle proof now treats its manager-attached DMA buffer record as teardown-blocking metadata. The pool detach path still checks authoritative live accounting first, then rejects zero-live detach while the proof buffer is attached as dmapool-buffer-attached. Before the active free path, the proof validates stale same-slot and wrong-identity FreeBuffer operations through capos-lib::device_authority::validate_dma_buffer_operation. The wrong identity cases cover wrong owner generation, wrong pool slot, wrong pool generation, and wrong buffer slot; each records dmabuffer-stale-handle, the exact validator reason (stale-owner-generation, wrong-pool, stale-pool-generation, or wrong-slot), side-effect-blocked, and a preserved manager-owned buffer record. The stale same-slot case continues to record stale-slot-generation and buffer_stale_free_preserved=true, then the proof observes that pool detach still fails as dmapool-buffer-attached. The proof clears the gate only after validating a proof-scoped active FreeBuffer operation, scrubbing and freeing the kernel-owned proof frame, and detaching the manager-owned buffer record. This remains lifecycle evidence only: no userspace DMAPool or DMA-buffer authority is exposed, no physical address or IOVA is exposed, and S.11.2 hostile stale-DMA smokes remain open.

Implementation note, 2026-05-03 02:31 UTC: the same zero-live device-manager DMAPool lifecycle proof now validates manager-record CompleteDescriptor authority for the attached DmaBufferHandle. Active completion validation records buffer_active_complete_result=ok; freed-buffer, reused-slot generation, and stale-after-revoke completion attempts fail closed as dmabuffer-stale-handle with exact pure-validator reasons and side-effect-blocked. This is manager-record validation evidence only: it does not complete a real descriptor, publish a completion queue entry, grant a userspace DMABuffer, run real DMA, or clean up or reuse production userspace DMA pages.

Implementation note, 2026-05-14 14:05 UTC (DDF IOMMU remapping Slice A1): the first slice that programs real Intel VT-d remapping state under QEMU has landed. The ## First QEMU Intel Remapping Smoke Acceptance Gate above defines the full bounded gate; that gate is being delivered as a sequenced A1/A2/B/C split (the slice was correctly scoped as bigger than one reviewable unit). This note records Slice A1.

Pinned QEMU shape: qemu-system-x86_64 8.2.2, -machine q35, -device intel-iommu,aw-bits=39 (3-level second-level page tables, 39-bit IOVA space). The kernel iommu module (kernel/src/iommu.rs, cfg(qemu)-only) selects one DMA-capable function that is not the live virtio-net bounce-buffer path (virtio-rng under the default smoke shape), allocates a root table, one context table, a 3-level second-level page table, and one mapped DMA page; encodes and writes one root entry, one context entry (binding the requester source id to a domain id, aw-bits=39, and the second-level root), and the second-level table-pointer / leaf entries through the HHDM; writes the Root Table Address Register; and runs the global-command/global-status SRTP-then-TE handshake, polling the status register for each step. The capability-register extended-capability IRO field (IOTLB register offset) is decoded and reported for Slice B’s benefit. MMIO ordering invariants are enforced with SeqCst fences: between the last in-memory table-entry write and the RTAR write, between the RTAR write and GCMD.SRTP, and between the latched root pointer and GCMD.TE.

Slice A1 proves translation with kernel-side structural evidence only, which the gate’s IOVA-export-discipline clause explicitly permits. Hardware confirms translation-enabled (GSTS.TES + GSTS.RTPS polled set), the written entry words are read-back-verified through the HHDM, the pure capos_lib::device_authority validator accepts the layout, and the unmapped IOVA’s 3-level walk structurally terminates at a non-present entry. The proof labels are scrupulously honest: mapped_iova_translated=structural (not hardware-dma), unmapped_iova_fault=structural-not-present (not observed), proof_evidence=kernel-side-structural. A real hardware-DMA translation and a real fault-status fault require driving a device virtqueue through the IOMMU; that is deferred to follow-on task A2 (a virtio-rng virtqueue driver as the DMA proof vehicle). Invalidation + IOTLB flush with completion polling (Slice B) and the IOMMU-backed stale-handle / stale-completion hostile smokes (Slice C) are also follow-on slices; at A1 their proof lines emit proof_result=deferred-next-slice. A2, B, and C have since landed — see the implementation notes below.

The table pages are recorded in a bounded ledger modeled on the device-manager DMAPool page-accounting discipline (allocate-record, scrub-before-free on the fail-closed path); mapping removal with IOTLB-flush-ordered scrub/free is Slice B. IOVA export stays disabled (iova_export=disabled-this-slice), no host physical address is user-visible (host_physical_user_visible=0), and no hostile-hardware isolation is claimed (hostile_hardware_isolation=not-claimed). The kernel-owned bounce-buffer fallback is unchanged for QEMU shapes without usable intel-iommu hardware and is emitted with the explicit fallback_policy=kernel-owned-bounce-buffer-only / remapping_tables=not-programmed labels. The new make run-iommu-remapping gate is asserted by tools/qemu-iommu-remapping-smoke.sh; make run-net, make run-dmapool-grant, and make run-diagnostics continue to prove the fallback path unchanged.

Implementation note, 2026-05-14 15:19 UTC (DDF IOMMU remapping Slice A2): the device-DMA proof vehicle has landed, upgrading the A1 structural proof to a real hardware-DMA proof and closing the literal hardware-DMA text of gate part

After the VT-d tables are programmed and GCMD.TE is set, a minimal virtio-rng virtqueue driver — split into a mapped-DMA phase (crate::virtio::prove_iommu_rng_mapped_dma) and an unmapped-DMA phase (crate::virtio::prove_iommu_rng_unmapped_dma) — drives the device QEMU exposes on the intel-iommu shape. The second-level table now installs four leaf entries inside one shared L1 page: the request buffer plus the three virtqueue ring pages (descriptor table, available ring, used ring). The driver programs the device’s QUEUE_DESC / QUEUE_DRIVER / QUEUE_DEVICE registers and the request descriptor’s addr field with the programmed IOVAs, never the host-physical page addresses. Because VT-d translation is global per requester once GCMD.TE is set, every DMA the device issues — every ring access and the entropy write — must walk the second-level table. A used-ring completion plus a non-zero buffer reading therefore proves a real hardware DMA reached the kernel page through the programmed IOVA translation: mapped_iova_translated=hardware-dma, proof_evidence=virtio-rng-hardware-dma.

The driver then publishes a second descriptor whose addr is the deliberately-unmapped IOVA and kicks the device. The device’s DMA to that IOVA raises a real VT-d translation fault; the kernel reads it back out of the Fault Status Register (FSTS.PPF), the first Fault Recording Register’s fault bit (FRCD[0].F at the decoded CAP.FRO offset), and the faulting page address recorded in FRCD[0] — which must equal the unmapped IOVA the device was pointed at — and reports unmapped_iova_fault=observed with the fault_recording_reason code. The fault gate is purely the VT-d register surface: whether QEMU’s virtio-rng still pushes the faulting descriptor onto the used ring afterward (it does, with the entropy write dropped) is QEMU device behavior, reported as the unmapped_descriptor_uncompleted diagnostic field but deliberately not a gate condition. The fault registers are cleared (write-1-to-clear) before the device DMA and again after the observed-fault read so no stale fault is mistaken for the proof and no fault is left for a later VT-d consumer. The MMIO discipline reuses A1’s NO_CACHE mapping and the descriptor/available-ring writes are SeqCst-fenced before the notify doorbell. The two phases are deliberately split so the kernel reads the fault registers strictly between them — the unmapped-IOVA descriptor is never in flight while the mapped-DMA result is judged. The virtio-rng device negotiates VIRTIO_F_ACCESS_PLATFORM, which is what makes QEMU route its DMA through the platform IOMMU rather than treating the ring registers as host-physical addresses; the run-iommu-remapping make target therefore creates the virtio-rng device with iommu_platform=on (a target-scoped override of the shared QEMU_SECOND_DEVICE, which no other run target needs because none of them drives virtio-rng DMA). The IOMMU-backed hostile smokes (Slice C) were a follow-on at A2 (proof_result=deferred-next-slice) and have since landed — see the Slice C implementation note below; IOVA export stays disabled and no host-physical address is user-visible.

Implementation note, 2026-05-14 17:19 UTC (DDF IOMMU remapping Slice B): the invalidation + IOTLB flush + invalidation-ordered scrub/free has landed, closing gate part 2. After the A2 hardware-DMA proof, kernel/src/iommu.rs runs a revocation cycle (run_invalidation_revocation_cycle) that models the device-manager DeviceOwnerState revocation FSM at the QueuesQuiesced -> Resetting -> DmaMappingsRemoved -> Dead steps. The cycle removes the four second-level leaf entries the A2 layout installed (request buffer + the three virtqueue ring pages), SeqCst-fences so the in-memory removal is visible to the IOMMU before the flush, then issues two register-based invalidations: a context-cache invalidation through the Context Command register (CCMD_REG at offset 0x28, CCMD.ICC set with CCMD.CIRG global granularity, polling CCMD.ICC clear for completion) and a domain-selective IOTLB invalidation through the IOTLB register at the A1-decoded CAP.IRO offset + 8 (IOTLB.IVT set with IOTLB.IIRG domain-selective granularity and the domain id in IOTLB.DID, polling IOTLB.IVT clear and reading IOTLB.IAIG back non-zero to confirm the request was serviced). Both completion polls are bounded by the same VTD_STATUS_POLL_BUDGET the A1 status handshakes use.

The hard ordering invariant — the whole point of the slice — is that the eight VT-d ledger-owned table/ring/used pages and the separate production DMAPool-owned request-buffer page are scrubbed and returned to their ledgers strictly after both completion polls return. A SeqCst fence sits between the completion reads and the scrub/free so the ordering is explicit in program order. A poll that exhausts its bounded budget fails closed: invalidation_completed is false, the pages are deliberately not freed (a page reused while hardware may still hold a cached translation through it is a stale-DMA hole), the ledgers keep them accounted rather than leaked-and-forgotten, and the proof line reports proof_result=fail-closed. Slice B uses register-based invalidation only: no GCMD.QIE queued-invalidation bit is set, so the A1 single-bit-GCMD discipline (correct only by minimalism — no other persistent GCMD bit set) still holds and the GCMD-reconstruct boundary is not crossed. The production DMAPool programming-abort path follows the same rule: if VT-d programming fails before root/translation state can expose the DMAPool page to hardware, the prepared DMABuffer/DMAPool records are detached; if the mapping may already be hardware-visible, the partial VT-d ledger is carried through the same leaf-removal and invalidation teardown before any VT-d table/ring page or production DMAPool page can be reused. The make run-iommu-remapping smoke and tools/qemu-iommu-remapping-smoke.sh now assert the invalidation proof line as proof_result=ok with mapping_removed=true context_cache_invalidated=true iotlb_flushed=true iotlb_actual_granularity_nonzero=true invalidation_completed=true page_reuse_ordered_after_invalidation=true table_pages_live_after=0 invalidation_interface=register-based-ccmd+iotlb, and forbid both a regression to the Slice-A deferred label and any invalidation_interface=queued value. The IOMMU-backed stale-handle / stale-completion hostile smokes (Slice C) were the deferred follow-on; they have since landed (see the note below), and as part of that work the single-phase run_invalidation_revocation_cycle was refactored into a two-phase run_target_revocation_phase + complete_revocation_teardown so the hostile re-drive can sit between the phases — the Slice B contract (every page freed strictly after its mappings are invalidated) is unchanged, and the combined invalidation proof line still asserts proof_result=ok for the complete teardown. IOVA export stays disabled and no host-physical address is user-visible.

Implementation note, 2026-05-14 19:13 UTC (DDF IOMMU remapping Slice C): the IOMMU-backed hostile stale-DMA smokes have landed, closing gate part 5 and the parent IOMMU remapping task. Closing the slice required refactoring the Slice B revocation into two phases so the hostile re-drive can run against a partially revoked remapping — the original single-phase cycle freed every page (including the virtio-rng descriptor table and available ring) before any hostile re-drive could observe a fault, so the re-driven device read an all-zero descriptor and issued no DMA at all. The kernel/src/iommu.rs ledger now classes each page by revocation-phase role: Target (request buffer + used ring — what the device’s DMA lands on), RingInfra (descriptor table + available ring — what the device reads to issue a DMA), and Table (root/context/second-level tables). run_target_revocation_phase removes the Target second-level leaves, invalidates the context-cache + IOTLB, and frees the Target pages — while deliberately keeping the RingInfra + Table pages mapped and live. run_hostile_stale_dma_cycle then re-drives the same still-live old-generation virtio-rng device through the new crate::virtio::prove_iommu_rng_stale_dma (each re-drive uses a fresh available-ring index past the A2 phases so the device sees a genuinely new descriptor): because the ring-infra pages are still mapped, the device reads a valid descriptor whose addr is a revoked target IOVA, so the DMA faults in the IOMMU (FSTS.PPF + FRCD[0].F, recorded faulting address is the stale IOVA) instead of reaching memory. A stale mapping-install attempt is refused (attempt_stale_mapping_install — the RevokedRemapping token is a dead-domain receipt with no live table handle, not install authority), and the freed Target page reads back as the scrubbed zeros. A second re-drive at the revoked used-ring IOVA faults too, publishes no device-written used-ring CQ entry into the freed page, exposes no memory to a would-be new owner, and makes no freed page eligible for reuse. complete_revocation_teardown then finishes the Slice B teardown by revoking + freeing the RingInfra + Table pages; the combined gate-part-2 invalidation proof line (invalidation_phases=target-then-ringinfra) still asserts proof_result=ok for the full two-phase teardown, with the same hard ordering invariant in each phase (pages freed strictly after that phase’s invalidation completion polls return; a poll that exhausts its budget fails closed and the phase’s pages are not freed). The load-bearing observation is the revoked translation state — not device cooperation, not a software ledger drop — blocking the stale DMA, confirmed by the VT-d fault registers plus a freed-page-stays-scrubbed read-back through the HHDM. Existing bounce-buffer stale-DMA evidence (the device-dma S.11.2.7/S.11.2.8 proofs) is preserved unchanged; the IOMMU-backed hostile smokes are strictly additive. The make run-iommu-remapping smoke and tools/qemu-iommu-remapping-smoke.sh now assert both hostile proof lines as proof_result=ok and forbid regression to the deferred, not-reached, or fail-closed labels. IOVA export stays disabled (iova_export=disabled-this-slice), no host-physical address is user-visible (host_physical_user_visible=0), and no hostile-hardware isolation is claimed (hostile_hardware_isolation=not-claimed).

Implementation note, 2026-05-23 (domain-scoped IOVA export): the selected QEMU Intel production DMAPool path now exposes the mapped request-buffer IOVA through DMABuffer.info only while the matching active DmaBufferHandle generation is live. The schema fields are deviceIova, deviceIovaScope=domain-scoped-iova, deviceIovaMeaning=meaningless-outside-domain, and iovaExport=domain-scoped-only; the production remapping proof asserts that deviceIova=0x200000 matches the installed second-level mapping. After the buffer is freed and the pool is released, an export attempt on the same handle fails closed with side-effect-blocked. The bounce-buffer grant path still reports deviceIova=0, deviceIovaScope=none, deviceIovaMeaning=not-exported, and iovaExport=disabled-future-only.

Implementation note, 2026-05-23 21:34 UTC (production-path hostile smokes): the selected QEMU Intel path now emits and asserts iommu-remapping: production dmapool hostile proof over the active manager-owned DMAPool / DMABuffer ledger. The proof ties the raw VT-d stale-handle and stale-completion faults to the production mapped IOVA, synthetic stale pool/buffer generation mismatch candidates, post-teardown stale-handle export failure, and per-device cross-domain boundary. It covers stale IOVA after revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion after reset, teardown-under-DMA ordering, and cross-domain stale-handle attempts; no second-level entry is installed for stale authority, no CQ entry is published, no new-owner memory is exposed, and page reuse stays ordered after invalidation completion. It does not claim a process-exit trigger for the IOMMU path; the existing make run-net bounce-buffer evidence remains the exit-under-DMA source. The same smoke asserts the complete_iommu_dmapool_mapping_teardown prerequisite-false return and the hold_iommu_dmapool_mapping_ledger_after_abort path as fail-closed branch evidence. Existing bounce-buffer S.11.2 evidence from make run-net and make run-dmapool-grant is preserved unchanged.

Implementation note, 2026-05-26 05:55 UTC (direct-DMA posture transition for the selected QEMU Intel path): the closeout slices above landed the full mechanism — real hardware DMA over a manager-owned DMAPool DMABuffer page mapped through the per-device IOMMU domain, domain-scoped IOVA export, per-device domains, and the production hostile matrix — but deliberately deferred the headline direct_dma=enabled claim behind iova_export=disabled-this-slice. The selected QEMU Intel path now emits iommu-remapping: direct-dma posture real_dma=attempted direct_dma=enabled remapping_tables=programmed trusted_domain=<domain-id> descriptors_reference=domain-scoped-iova mapped_page_source=manager-owned-dmabuffer mapping_installed_before_doorbell=true invalidated_before_page_reuse=true bounce_buffer=not-required exported_device_addresses=iova-only host_physical_user_visible=0 hostile_hardware_isolation=not-claimed proof_result=ok. Every field is computed from the real proof facts, not asserted as a constant: remapping_tables=programmed requires the root/context/second-level entries written plus the SRTP/TES handshakes; real_dma=attempted requires the virtio-rng device’s mapped DMA to have completed through the programmed IOVA (hardware_dma_translation_proven); direct_dma=enabled additionally requires the manager-owned DMAPool mapping to have been installed before the device doorbell; and invalidated_before_page_reuse folds in the two-phase revocation’s page_reuse_ordered_after_invalidation and invalidation_completed results. This is bounded QEMU-emulator evidence, so hostile_hardware_isolation stays not-claimed (real hostile-hardware isolation needs real hardware, not QEMU). The no-IOMMU run-net / run-dmapool-grant bounce-buffer fallback is untouched: it keeps direct_dma=blocked with no IOVA export, and make run-iommu-remapping now forbids this path from regressing to the bounce-buffer fallback proof or to a blocked/not-attempted posture. The contract table in “Downstream-Contract Scaffolding” (direct-remapping domain: direct_dma=enabled, remapping_tables=programmed, exported_device_addresses=iova-only) is now backed by an emitted, asserted posture on the selected path.

Authority Model

Device authority is split into three independent capabilities:

DMAPool: authority to allocate, expose, and revoke device-visible memory within a kernel-owned physical range or IOMMU domain.
DeviceMmio: authority to map and access one device’s register windows.
Interrupt: authority to wait for and acknowledge one interrupt source.

Holding one of these capabilities never implies the others. A driver needs all three for a normal device, but the kernel and init can grant, revoke, and audit them separately.

Production Handle Epoch Invariants

All three object families use opaque handles whose identity is checked against kernel-owned records before every operation. A raw object id is never enough to authorize DMA, MMIO, interrupt waits, acknowledgements, descriptor submission, or teardown. A handle is accepted only when all of these facts match in the same ownership transaction:

the object id resolves to a live record of the expected type;
the handle’s device owner generation matches the current device-manager owner record;
the handle’s pool, mapping, slot, source, or route generation matches the current reusable subrecord;
the record state permits the requested operation.

The exact ABI shape may change when the capability surface is implemented, but production handles must carry the equivalent identity:

#![allow(unused)]
fn main() {
struct DmaPoolHandle {
    device_id: u32,
    owner_generation: u64,
    pool_id: u32,
    pool_generation: u64,
}

struct DmaBufferHandle {
    device_id: u32,
    owner_generation: u64,
    pool_id: u32,
    pool_generation: u64,
    slot: u32,
    slot_generation: u64,
}

struct DeviceMmioHandle {
    device_id: u32,
    owner_generation: u64,
    bar: u8,
    mapping_id: u32,
    mapping_generation: u64,
}

struct InterruptHandle {
    device_id: u32,
    owner_generation: u64,
    source_id: u32,
    source_generation: u64,
    route_generation: u64,
}
}

Object identity fields have distinct jobs:

DMAPool handles name the claimed device, the device owner generation, and the pool record generation. Buffer handles issued by the pool repeat the device-owner and pool identity and additionally name a buffer slot and slot generation. The pool identity prevents a handle from crossing devices or owners; the slot identity prevents a freed or reused buffer slot from accepting an old descriptor, free, or completion.
DeviceMmio handles name the claimed device, owner generation, BAR or subrange mapping record, and mapping generation. The physical range, cache attributes, and access policy remain in the kernel record and are not user-editable handle fields.
Interrupt handles name the claimed device, owner generation, source record, source generation, and route generation. Waiter records may carry their own waiter generation internally, but they must be invalidated whenever the source or route generation changes.

Owner generations and subrecord generations are intentionally separate. The device owner generation belongs to the device-manager ownership record and invalidates every DMAPool, DeviceMmio, and Interrupt handle for the old owner when ownership is revoked, transferred, reset, or reassigned. Pool, buffer-slot, MMIO-mapping, interrupt-source, and route generations belong to records that may be reused below the device owner. They prevent stale buffer, mapping, route, waiter, and completion handles from matching a newly allocated subrecord even when the device id or pool id is reused.

Every epoch is non-wrapping for authority purposes. Implementations must use an epoch width that cannot wrap during the object’s lifetime, or permanently retire the exhausted device, pool, slot, mapping, source, or route record. Epoch exhaustion is a closed allocation or reassignment failure; it must never wrap back to a value that could match an old handle.

Generation mismatch, wrong object type, wrong device owner, freed slot, detached source, revoked mapping, and wrong device-owner state are hard closed results. The failed operation must not program a descriptor, ring a doorbell, perform an MMIO write, unmask or acknowledge an interrupt, wake a waiter, publish a CQE, decrement completion accounting, free a page for reuse, or mutate the device ledger except for bounded failure accounting or audit metadata.

Transfer, revoke, reset, and reassignment are ordered around those epoch checks:

Transfer: The old owner leaves Active, the owner generation advances, and old handles become invalid before a new owner receives handles. A transfer may preserve hardware state only after old interrupt notifications, MMIO write authority, and DMA submissions are either quiesced or represented by old-generation ledger entries that the new owner cannot consume.
Revoke: The device manager invalidates user-visible handles first, then follows the revocation order below: MMIO write authority removed, interrupts masked or detached, queues quiesced or reset, mappings removed, and pages scrubbed before release.
Reset: Reset or disable advances the owner generation and any affected source, route, pool, mapping, and buffer-slot generations before new handles can be issued. If old DMA writes cannot be proven stopped, buffer slots stay unavailable until reset completion and mapping invalidation prove reuse is safe.
Reassignment: Interrupt sources, MMIO mappings, and DMA pool records are detached or unmapped, their subrecord generations advance, pending waiters or completions are drained or marked stale, and only then can a new owner receive authority for the reused source, mapping, or slot.

Handle reuse rules:

stale handles fail closed;
freed-handle reuse fails closed;
reallocated slots must not restore authority to old handles;
old interrupt waiters must not observe or acknowledge a new owner’s interrupt source;
old DMA handles must not reference a newly allocated buffer in the same slot.

Production proof obligations are split between host tests and QEMU smokes. Host tests must cover the pure validator and state-machine cases: stale owner generation, stale pool or mapping generation, stale buffer slot, stale interrupt source or route, wrong owner, wrong device, wrong object type, freed object, wrong state, epoch exhaustion/retirement, and no side effects on failure. QEMU smokes must prove the hardware-facing ordering: stale DMA handles after free/reuse cannot submit descriptors, stale DMA completions after revoke/reset cannot publish CQEs or mutate reused buffers, stale MMIO handles cannot ring doorbells after revoke, stale interrupt waiters or acknowledgements cannot wake or affect a new owner, and process-exit or driver-crash teardown reaches a zero-live ledger before pages are reused. These production handles and proofs remain open; the current QEMU scratch proofs are prerequisite evidence for this contract, not completion of it.

Implementation note, 2026-05-02 13:18 UTC: capos-lib::device_authority now implements the bounded pure host-testable validator prerequisite for these handle epoch invariants. The module models the documented handle and record identity fields for DMAPool, DMA buffer, DeviceMmio, and Interrupt, separates device-owner generations from pool, slot, mapping, source, and route generations, returns explicit fail-closed error labels, blocks the relevant side-effect class on validation failure, and refuses epoch wrap or retired epoch reuse. This does not expose production userspace handles, wire kernel device paths, attach budget/OOM policy to real handle creation, or complete the QEMU stale-handle or S.11.2 hostile-smoke gates.

Implementation note, 2026-05-03: the pure host-test operation matrix now enumerates every current validator operation variant: DMAPool::{AllocateBuffer,IssueBufferHandle}, DMABuffer::{SubmitDescriptor,CompleteDescriptor,FreeBuffer}, DeviceMmio::{Map,Read,Write,RingDoorbell,Unmap}, and Interrupt::{Wait,Acknowledge,Mask,Unmask}. Each row asserts active acceptance plus stale owner/subrecord, freed, revoked, and retired failures with the exact blocked side-effect class for that operation. This remains ABI-independent host-test evidence only; it does not create production userspace handles or replace the QEMU stale-handle and S.11.2 hostile-smoke gates.

Implementation note, 2026-05-02 13:43 UTC: the current kernel device-manager DMAPool lifecycle and imported-live accounting proofs now adapt their BDF, owner generation, pool slot, and pool generation into capos-lib::device_authority records. The QEMU proof records active validator success, stale-after-revoke failure as dmapool-stale-handle, the validator reason stale-owner-generation, and side-effect-blocked. This is still a bounded kernel-proof adapter, not production userspace handle exposure, DeviceMmio/Interrupt handle wiring, production page cleanup, or S.11.2 hostile smoke completion.

Implementation note, 2026-05-02 17:04 UTC: the current kernel device-manager also has a bounded manager-owned DeviceMmio record proof adapter. The record carries BAR, mapping id, mapping generation, and owner generation fields, and the QEMU virtio-rng device-manager path validates a RingDoorbell operation through capos-lib::device_authority. After revoke begins, the old handle fails through the pure validator as stale-owner-generation, records devicemmio-stale-handle plus side-effect-blocked, and no doorbell write is attempted. The lifecycle proof blocks RevokingHandles -> MmioRevoked while the record is attached, then allows the transition after bounded detach. This does not expose production userspace DeviceMmio authority, program real BAR mappings, create mapping objects, or complete hostile stale-MMIO smokes.

Implementation note, 2026-05-02 17:29 UTC: the bounded DeviceMmio adapter now derives the proof mapping from the first decoded PCI memory BAR on the tested PciDevice through the shared BAR-region validator. The attached manager-owned record carries that BDF/BAR/base/length metadata and validates that it is the same BDF, a memory BAR, nonzero length, and the same BAR named by the handle before constructing the pure capos-lib::device_authority record. The QEMU smoke asserts region_source=pci-decoded-memory-bar, region_bound_to_manager=true, bar_present=true, bar_memory=true, bar_base, and bar_length. This is still prerequisite evidence only: it does not create userspace DeviceMmio handles, program real MMIO mappings, enforce cache attributes or write policy, or write a real doorbell.

Implementation note, 2026-05-02 17:54 UTC: the same bounded DeviceMmio adapter now records fail-closed malformed-region evidence before the positive attach. Wrong-BDF metadata, wrong BAR/handle mismatch, and zero-length region metadata all report devicemmio-region-invalid, with region_invalid_mapping=not-created and region_negative_side_effect=side-effect-blocked; the proof still records real_mmio_mapping=not-programmed and real_doorbell=not-written. This is bounded manager-proof evidence only. It does not create userspace DeviceMmio handles, map real BAR pages, enforce cache attributes or write policy, complete hostile stale-MMIO smokes, or perform a real doorbell write.

Implementation note, 2026-05-02 20:14 UTC: the bounded DeviceMmio adapter now stores future mapping policy metadata on the attached manager-owned record and reads it back through the active record plus matching DeviceMmioHandle. The proof line records policy_scope=manager-attached-devicemmio-cache-write-policy, cache_policy=device-uncacheable, page_table_protection=capability-scoped-device-nx, write_policy=claimed-registers-and-doorbells-only, executable=blocked, userspace_devicemmio=not-started, host_physical_user_visible=0, policy_bound_to_manager=true, and policy_result=ok. A tampered cache/write policy record fails closed with policy_tamper_result=fail-closed, policy_tamper_mapping=not-created, and policy_tamper_side_effect=side-effect-blocked. This is still metadata proof only: no PAT/MTRR or page-table programming is performed, no userspace DeviceMmio handle is created, no real BAR mapping object exists, and no doorbell is written.

Implementation note, 2026-05-02 20:45 UTC: while the same bounded manager-owned DeviceMmio record is still active, the proof now validates hostile RingDoorbell handles through a proof-scoped adapter that uses the already-attached record rather than manager lookup short-circuits. Wrong owner generation, wrong mapping generation, wrong mapping id, wrong BAR, and wrong BDF/device fail closed with exact pure-validator reasons stale-owner-generation, stale-mapping-generation, wrong-mapping, wrong-bar, and wrong-device. Each records side-effect-blocked; the proof also records that the attached manager record is preserved, no fake mapping is created, and no doorbell is written. This remains bounded proof evidence only: production userspace handles, real MMIO mappings, real cache/write-policy enforcement, and hostile stale-MMIO smokes remain open.

Implementation note, 2026-05-03 00:36 UTC: the schema and kernel now include a result-only DeviceMmio.info skeleton that can wrap a manager-issued DeviceMmioHandle. The object validates the live device-manager record through validate_devicemmio_record() before returning status labels such as userspaceDeviceMmio=manager-issued-skeleton, managerRecord=validated-active, realMmioMapping=not-programmed, realDoorbell=not-written, hostPhysicalUserVisible=false, directMmio=blocked, registerRead=blocked, registerWrite=blocked, and bootstrapGrant=blocked. The QEMU device-manager lifecycle proof constructs that cap object while the attached record is active, records devicemmio_cap_info_result=ok, exercises the serialized CapObject::call(0, &[]) path and decodes the returned DeviceMmio.info Cap’n Proto result as devicemmio_cap_serialized_call_result=ok, then verifies the same cap fails closed after revoke begins as devicemmio_cap_stale_after_revoke_result=devicemmio-stale-handle; the same stale object also fails the serialized method-0 path as devicemmio_cap_serialized_stale_after_revoke_result=invoke-failed. A later manifest-grant smoke explicitly releases the granted DeviceMmio cap through CAP_OP_RELEASE and proves a subsequent typed DeviceMmio.info call fails closed from userspace. A focused grant-cycle smoke now repeats that grant, release, and stale-info proof twice in sequence and asserts the second manager-grant-source acquire receives a fresh mapping generation after the first release; the same smoke also decodes both acquire/release cycles through the typed volatile HardwareAuditLog.snapshot surface. The focused hardware-audit interrupt-waiter smoke also decodes recent boot-time DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable lifecycle records from the current volatile 16-record snapshot window. The same smoke now uses the startSequence cursor to decode older retained DeviceMmio lifecycle rows that the default latest 16-record tail has skipped. A 2026-05-10 15:33 UTC manifest-grant follow-up turns DeviceMmio.map from admission-only into a read-only userspace VMA over the boot-preseeded BAR page already used by brokered read32. The typed smoke validates the active DeviceMmioOperation::Map authority check, rejects writable, executable, unknown, zero-size, unaligned, out-of-BAR, and overflow requests with typed no-side-effect results, reads the same QEMU BAR value through the returned userspace address and DeviceMmio.read32, rejects a duplicate active map, explicitly calls DeviceMmio.unmap, proves a second unmap is a typed no-op, remaps, and proves stale unmap fails closed after cap release. Release/drop/driver-crash/reset-disable cleanup revokes any borrowed user VMA before detaching the manager record. This is read-only BAR VMA evidence only: it does not add writable MMIO, volatile register writes, doorbells, host physical/IOVA exposure, post-userspace kernel MMIO mappings, IOMMU programming, durable/signed audit persistence, concurrent sharing semantics, or a production provider-driver consumer. A 2026-05-10 20:06 UTC follow-up promotes DeviceMmio.write32 to one bounded brokered volatile dword write through that same boot-preseeded kernel MMIO cache after active manager-attached handle, owner/state, policy/region, pure Write authority, dword alignment, decoded-BAR range validation, and a single provider-scoped claim derived from PCI MSI-X metadata, including BDF, BAR, BAR base, offset, and value. The focused proof writes the claimed virtio-rng MSI-X entry-0 vector-control mask dword, reads it back through both brokered read32 and the read-only userspace VMA, then proves an unclaimed message-address dword write leaves the original value unchanged. Invalid range and unclaimed calls remain typed no-write results, while stale or released handles fail closed before any write and do not return a write32 result payload. This does not add writable userspace BAR mappings, arbitrary register writes, doorbells, host physical/IOVA exposure, post-userspace arbitrary remaps, IOMMU programming, or a production provider-driver consumer.

A 2026-07-16 02:58 UTC follow-up (security-devicemmio-prod-grant-bar-selection-policy) tightens which device the production (not(feature = "qemu")) DeviceMmio grant source binds and which BAR offsets its brokered read32 exposes. Previously the default cloudboot build staged ProdGrantClass::AnyFunction, so the grant bound to the first enumerated PCI function carrying any >=4 KiB memory BAR – under QEMU the emulated VGA framebuffer at 0000:00:02.0, a device the provider was never intended to touch, and with no exclusion of an MSI-X table page from the read surface. The grant source now consults a manifest-authored SystemConfig.deviceMmioTarget PCI identity (mirroring blockDeviceTarget): when present it binds the brokered DeviceMmio grant to that exact enumerated function (segment:bus:device.function), failing closed (devicemmio-target-selector-no-match, no grant staged) when no enumerated function matches, the matched function has no usable memory BAR, or the function’s MSI-X layout is not knowable (so the MSI-X window is not provably excluded). A layout is not knowable in any of three cases: an enumerated MSI-X capability whose header fails to parse, a capability chain the walker could not follow to its terminator, or more than one enumerated MSI-X capability. The second case matters because the enumeration is device-controlled: a chain that ends in a malformed next pointer, a cycle, a reserved capability id, or an over-long walk hides whatever follows the break, so “no MSI-X capability was enumerated” is proof of “no MSI-X table in the BAR” only when the enumeration itself was complete. The third case matters because the recorded exclusion carries exactly one table range and one PBA range: a conforming function presents exactly one MSI-X capability, so a second one describes a device-authored layout the single-range exclusion cannot cover, and folding both into it would leave the earlier table readable. A function whose Status register declares no capability list at all is a complete answer and binds normally. Independently, the grant source computes the bound device’s MSI-X table / PBA sub-windows from its MSI-X capability and records them on the staged mapping; read_devicemmio_u32 refuses a brokered read that lands in an MSI-X window (devicemmio-read32-msix-window-excluded, read_performed=false, no volatile access) so a provider granted a BAR that overlaps the MSI-X table can read the non-MSI-X portion of its authorized device’s BAR but never the MSI-X message-address/data or pending bits. The run-cloud-devicemmio-grant proof binds the virtio-net function 0000:00:03.0 (Red Hat 0x1af4:0x1000) on its first usable memory BAR (BAR1, which carries the MSI-X table at offset 0x0 and PBA at 0x800), asserts the bound BDF/BAR match the manifest identity, that brokered reads into the MSI-X table and PBA windows fail closed while a non-MSI-X offset reads normally, and emits selection=manifest-bound msix_read32=refused. The hostile capability chains cannot be presented by an emulated QEMU device, so the same proof additionally walks synthetic config-space images through the live walker and classification as a bounded boot-time fixture (devicemmio-grant-source-prod: msix-layout-fixture proof_scope=bounded-synthetic-config-chain-selftest case=... check=ok). Each case asserts the walk outcome the chain produces, whether the walk reaches an MSI-X header the image physically contains, and the resulting bind / fail-closed verdict – including the case this closes, an MSI-X capability at config offset 0x60 that a malformed chain hides from the walk. That holds the property under assertion rather than leaving it to the coincidence that the kernel’s own MSI-X programming path walks capabilities the same way. This is a selection-policy + read-surface tightening only: no new mapping authority, no writable MMIO, no DMA/interrupt authority, and the class-based fallback is retained for feature-carrier proof manifests that do not author a deviceMmioTarget.

Refusing a whole function on a truncated walk is a fail-closed availability trade, and the fixture above cannot price it: synthetic chains prove the refusal fires, not that any real device trips it. A device with a sloppy but harmless capability chain would lose its brokered DeviceMmio grant (devicemmio-target-selector-no-match) and disclose nothing. To make that cost observable rather than assumed, the production inventory walks every enumerated function through the same pci::walk_chain the classification consumes and emits its outcome as cloudboot-evidence: capability-walk <token>, which tools/cloudboot/run-test.sh records in provider.json as enumerated_capability_walks plus the capability_walk_truncations subset (the functions a boot would refuse).

Two GCE runs on 2026-07-16 – the default e2-small shape (run 1784207318-0a8f, 13:08 UTC) and a --nic-type=GVNIC shape (run 1784207698-629a, 13:14 UTC) – enumerated seven functions each with no truncated walk on either. Every function either reached a next == 0 terminator – virtio-SCSI 1af4:1004, virtio-rng 1af4:1005, and, depending on the shape, either virtio-net 1af4:1000 or the GCE gVNIC 1ae0:0042, each walk.complete with MSI-X id 0x11 reached – or declared no capability list at all (8086:1237, 8086:7110, 8086:7113, and the virtio balloon 1af4:1002). The reserved-id refusal is therefore retained: no GCE-presented function terminates its chain on a reserved id, so tolerating that shape would weaken the MSI-X-layout-unknown property for no device that is known to exist.

That evidence is narrower than “real hardware”, and the gap is worth naming. GCE presents a guest a virtualized i440FX/PIIX3 topology, so six of the seven functions per run are device classes QEMU also emulates and the local run-cloud-devicemmio-grant gate already walks; the gVNIC 1ae0:0042 is the only enumerated function outside that set. Every complete chain observed was one capability deep (caps.1), i.e. shallower than the six-capability modern-virtio chain the local gate walks. What remains uncovered is the device class most likely to stress the walker: a passthrough or physical function with a deep, vendor-capability-laden chain. Such a device is covered only once its own boot emits this marker.

A 2026-05-26 07:32 UTC follow-up (ddf-userspace-writable-devicemmio-interrupt) proves the cross-authority non-implication required by the gate (“holding one authority must not imply either of the others”). Each grant smoke’s granted CapSet is its sole authority source: the DeviceMmio smoke holds only console + device_mmio and the Interrupt smoke only console + interrupt. The smokes assert that the other DDF grants (interrupt/device_mmio and dmapool) are absent from the CapSet, and that the held cap cannot be reinterpreted as another interface because the kernel-delivered interface id is fixed at grant time (negative-authority ... result=ok lines, with the kernel “spawned … 2 caps” structural counterpart). This is a non-implication proof over the already-landed authorities; it adds no new kernel MMIO/IRQ/DMA surface. Real userspace wait/acknowledge over the live route with deferred LAPIC EOI is proven separately by the provider tx_interrupt/rx_interrupt consumer (make run-ddf-provider-consumer).

`DMAPool` Invariants

DMAPool is the only future userspace-facing authority that may cause a device-visible DMA address to exist.

Authority: A holder may allocate buffers only from the pool object it was granted. It may not request arbitrary physical frames, import caller virtual memory by address, or derive another pool.
Handle identity: A pool operation checks the claimed device id, owner generation, pool id, and pool generation before changing pool state. Buffer operations additionally check the buffer slot and slot generation before descriptor validation, completion accounting, free, scrub, or reuse.
Physical range: Every exported device address must resolve to pages owned by the pool. The kernel records the allowed host-physical page set and validates every descriptor mapping against that set before a device can use it. If an IOMMU domain backs the pool, the exported address is an IOVA, not raw host physical memory.
Ownership: Each DMA buffer has one pool owner, one device-domain owner, and explicit CPU mappings. Sharing a buffer with another process requires a later typed memory-object transfer; copying packet data is the default until that object exists.
No raw grants: Userspace never receives an unrestricted host-physical address. A driver may receive an opaque DMA handle or an IOVA meaningful only to its DMAPool/device domain. It cannot turn that value into access to unrelated RAM.
Residency: DMA pages are committed before exposure to the device, resident for the entire device-visible lifetime, unswappable, and scrubbed before reuse by another owner.
Bounds: Buffer length, alignment, segment count, and queue depth are bounded by the pool. Descriptor chains that point outside an allocated buffer, wrap arithmetic, exceed device limits, or reference freed buffers fail closed before doorbell writes.
Revocation: Revoking the pool first quiesces the device path using it, prevents new descriptors, waits for or cancels in-flight descriptors, then removes IOMMU mappings or invalidates bounce-buffer handles before freeing pages.
Reset: If in-flight DMA cannot be proven stopped, revocation escalates to device reset through the owning device object before pages are reused.
Residual state: Pages returned from a pool are zeroed or otherwise scrubbed before reuse by a different owner. Receive buffers are treated as device-written untrusted input until validated by the driver or stack.

Device-visible memory authority is not ordinary MemoryObject authority. FrameAllocator and MemoryObject must not become raw physical-address escape hatches. A future shared-buffer transfer may share CPU-visible packet bytes after validation, but it does not by itself grant IOVA creation, descriptor programming, or device write authority.

For the in-kernel QEMU smoke, the kernel is the only DMAPool holder. The same invariants apply internally even though no userspace capability object is exposed yet.

Implementation note, 2026-04-24: the initial virtio-net transport uses kernel-owned frame-allocator pages for RX/TX split-virtqueue descriptor, available, and used rings plus the one-shot TX descriptor proof buffer, ARP TX buffer, ICMP TX buffer, smoltcp adapter/TCP TX buffers, and posted RX packet buffers. The smoltcp adapter copies completed RX frame bytes out of those device-written pages before handing them to the stack. Those pages are programmed into the device only by kernel code after modern PCI transport discovery and feature negotiation; no userspace process receives a DMA buffer, physical address, or BAR mapping.

Implementation note, 2026-05-02: the current QEMU virtio-net DMA path routes those kernel-owned pages through a bounded device_dma pool ledger. The net smoke proves live pool bytes, page counts, page-rounded MMIO mapping bytes, config/RX/TX interrupt holds, RX/TX ring depths, and RX/TX submission/completion and in-flight descriptor accounting. This is the first kernel-owned DMAPool accounting proof; it does not expose userspace DMA, MMIO, or interrupt handles and does not complete the production S.11.2 hostile-smoke gate.

Implementation note, 2026-05-02 06:59 UTC: the kernel-owned device_dma ledger now has an explicit bounded budget/OOM policy for the current virtio-net proof path: 32 DMA pages, 131072 DMA bytes, queue depth 8, submission depth 8, four page-rounded MMIO mapping holds, 16384 MMIO mapped bytes, and three interrupt holds. make run-net emits a scratch-ledger device-dma: budget oom proof ... proof_result=ok line proving page and byte allocation over budget, overlarge queue depth, duplicate and over-budget MMIO holds, MMIO byte over budget, duplicate and over-budget interrupt holds, and descriptor submission beyond queue depth all fail closed without mutating the live virtio-net ledger; the proof also revalidates the live ledger. This is a bounded prerequisite for the production DMAPool contract. It still does not expose userspace DMAPool, DeviceMmio, or Interrupt handles, wire real lifecycle hooks, program IOMMU remapping domains, or close the S.11.2 hostile smoke matrix.

Implementation note, 2026-05-02 16:31 UTC: attached device-manager DMAPool records now also carry that budget profile. The lifecycle and imported live-accounting proofs read page, byte, queue-depth, submission-depth, MMIO mapping, MMIO byte, and interrupt-hold budgets through the active device-manager record plus the matching DmaPoolHandle. Queue and submission depth remain per-queue limits; the manager proof records queue_count=2 plus derived aggregate in-flight/submission budgets and checks imported virtio-net accounting against those aggregate totals. This keeps the budget policy tied to manager state but still does not create userspace DMAPool handles or enforce budgets at userspace handle creation, transfer, or revoke.

Implementation note, 2026-05-02 21:13 UTC: the zero-live device-manager DMAPool lifecycle proof now validates a proof-scoped tampered AttachedDmaPoolBudgetPolicyRecord through the manager budget-policy helper while the attached pool record is active. The tampered record uses the wrong policy scope/source/label plus stricter page, byte, queue, in-flight, MMIO, interrupt, and submission budgets, and it fails closed before it can be treated as a usable policy. The QEMU proof records budget_policy_tamper_result=fail-closed, budget_policy_tamper_allocation=not-created, budget_policy_tamper_ledger=not-mutated, budget_policy_tamper_teardown=not-advanced, and budget_policy_tamper_side_effect=side-effect-blocked. This is still bounded metadata proof only: no userspace DMAPool handle is exposed, no production userspace DMA page is allocated, freed, or reused, and no real DMA teardown is claimed.

Implementation note, 2026-05-02 22:16 UTC: the manager-owned DMAPool budget-accounting proof now fails closed on accounting over the attached budget instead of only logging passive booleans. The positive zero-live and imported-live budget_*_within_policy=true labels call a helper that revalidates the active attached record, matching DmaPoolHandle, owner, active state, and attached budget policy before accepting the record accounting. While the zero-live record remains active, synthetic attached-accounting candidates exceed buffer count, page count, byte count, and the current aliased in-flight/submission total; each candidate fails closed before it can be treated as usable manager state. The QEMU proof records exact overrun reasons, no fake allocation, no ledger mutation, no teardown advancement, and side-effect blocking. A proof-scoped over-budget attach candidate now fails before pool generation allocation and records preserved generation state. This remains bounded manager-record evidence only; later grant slices add the first single-page bounce-buffer allocation/free authority, but multi-buffer allocation, DMA mapping, descriptor execution, IOMMU programming, production driver consumption, and S.11.2 hostile smokes remain open.

Implementation note, 2026-05-02 23:59 UTC, updated 2026-05-03: the schema and kernel now include a result-only DMAPool.info skeleton that can wrap a manager-issued DmaPoolHandle. The object validates the live device-manager record through validate_dmapool_record() before returning status labels such as userspaceDmaPool=manager-issued-bounce-buffer, realDma=not-attempted, hostPhysicalUserVisible=false, and directDma=blocked. The QEMU zero-live device-manager lifecycle proof constructs that cap object while the attached record is active, records dmapool_cap_info_result=ok, exercises the serialized CapObject::call(0, &[]) path and decodes the returned DMAPool.info Cap’n Proto result as dmapool_cap_serialized_call_result=ok, then verifies the same cap fails closed after revoke begins as dmapool_cap_stale_after_revoke_result=dmapool-stale-handle; the same stale object also fails the serialized method-0 path as dmapool_cap_serialized_stale_after_revoke_result=invoke-failed. The same proof now exercises DMAPool.allocateBuffer through call_with_table() on a real DmaPoolCap in a CapTable: it decodes bufferIndex=0, verifies CAP_CQE_TRANSFER_RESULT_CAPS, cap_count=1, the transfer-result record’s DMABUFFER_INTERFACE_ID, the non-transferable same-session result-cap hold, and DMABuffer.info through the returned result cap. Duplicate allocation and stale-after-revoke allocation both fail closed without adding result caps. The duplicate-active valid-size path now reports a structured schema result with result=dmapool-already-attached, reason=active-buffer-attached, sideEffect=side-effect-blocked, and bufferPresent=false; the duplicate path also preserves the manager generation counter. Invalid-size requests use the same no-result-cap response shape with result=dmapool-allocation-request-invalid, the exact request reason, sideEffect=side-effect-blocked, and bufferPresent=false. As of the 2026-05-08 DMAPool grant-source follow-up, the same bounded path is also available through KernelCapSource::DmaPool: the grant source attaches a fresh zero-live manager-owned pool record, stages matching zero-live release evidence, and lets the child mint one DMABuffer result cap. A 2026-05-09 release-order follow-up has the smoke explicitly release the parent DMAPool before the result DMABuffer; the parent detach remains pending until the DMABuffer frees the page and completes the staged zero-live pool detach. A later 2026-05-09 follow-up adds a typed DMABuffer.freeBuffer method for that bounded result cap: the method reuses the same FreeBuffer authority validation and page scrub/ledger/frame-free cleanup path as cap release, emits a free-buffer audit event, invalidates later DMABuffer.info, and makes the later cap release a no-op detach. A second bounded follow-up keeps the parent DMAPool live after that first explicit free, reallocates the same slot with a fresh slotGeneration, and then repeats the parent-first release proof on the second buffer. The focused read-side HardwareAuditLog.snapshot smoke also decodes both slot generations, both typed free-buffer records, the parent DmaPool release, and both no-op release-after-free records through the volatile audit cap. The run-net DMABuffer driver-crash and reset-disable proofs also cover the same pending-parent completion path so successful buffer cleanup cannot orphan the staged parent release. A later admission follow-up adds typed DMABuffer.submitDescriptor to the same manifest-granted bounce-buffer path: the method validates the active manager-attached buffer epoch through the existing DmaBufferOperation::SubmitDescriptor authority validator, echoes the queue/descriptor/length and generation identity, and proves the same call fails closed after freeBuffer revokes the old cap. A later symmetric follow-up adds typed DMABuffer.completeDescriptor to the same bounded path. The 2026-05-10 request-shaping follow-up routes both typed descriptor calls through a shared pure bounded descriptor validator: valid bounce-buffer requests return ok request labels plus queue_count=4, descriptor_count=8, and buffer_bytes=4096, while out-of-range queues/descriptors, zero submit lengths, submit lengths beyond the bounce buffer, and completion lengths beyond the bounce buffer fail closed as dmabuffer-descriptor-request-invalid with side-effect-blocked before any descriptor side effect. A later manager-accounting follow-up records only the bounded manager counter: submit returns manager-inflight-recorded and raises DMAPool.info live_inflight to 1, completion returns manager-inflight-completed and restores it to 0, and a valid completion with no outstanding submission returns dmabuffer-no-inflight-submission. Too-small descriptor result buffers are rejected before accounting mutation, and cap-table release drains bounded in-flight accounting before detaching the bounce buffer. The 2026-05-10 06:37 UTC follow-up makes this allocateBuffer/freeBuffer page lifecycle the first production-labeled single-page bounce-buffer allocation/free authority. The typed surfaces report userspaceDmaPool=manager-issued-bounce-buffer, allocation=single-bounce-buffer-page, recordPool=userspace-bounce-buffer-live, zero-live-dmapool-bounce-buffer, and freeBuffer=bounce-buffer-page; the underlying device_dma ledger uses a manager-attached bounce-buffer helper and scrubs before frame free. A 2026-05-10 11:44 UTC follow-up extends that bounded manifest-granted path to two fixed manager-owned slots; a 2026-05-10 12:49 UTC follow-up extends the same path to three fixed manager-owned slots: slot 0, slot 1, and slot 2 can be live together, DMAPool.info reports three live buffers/pages while all are attached, a fourth allocation fails closed as dmapool-already-attached, and slot 0 can be freed and reused with a fresh generation while slots 1 and 2 remain live. There is still no allocation beyond those three fixed slots, real device-visible DMA mapping, host physical address or IOVA exposure, BAR mapping, production descriptor-ring mutation, CQ publication, IOMMU programming, or production driver consumer. Stale allocation attempts preserve the live backing, and page allocation failure occurs before buffer-generation allocation so it does not burn a generation. The 2026-05-10 13:45 UTC follow-up (3bbeb3d4) adds explicit typed DMABuffer.unmap for the mapped bounce-buffer userspace VMA. The method validates the live DMABuffer handle before reporting success or no-op, removes only the borrowed VMA owned by that mapping for the calling process, and publishes the mapping as absent only after the borrowed-range ownership check, page-table unmap, TLB wait, and waiter cleanup succeed. While teardown is in progress, concurrent map/free/release paths fail closed against an in-progress mapping state. A second unmap returns dmabuffer-mapping-absent / no-user-mapping with no side effect. This is userspace VMA cleanup only: it does not free or scrub the bounce page, detach the buffer record, change DMAPool.info live buffer/page/in-flight counts, program or remove real DMA or IOMMU mappings, expose host physical/IOVA addresses, mutate descriptor rings, publish CQ entries, or add a production driver consumer. The 2026-05-10 14:12 UTC follow-up moves bounded descriptor accounting from a single pool-global descriptor identity to per-slot state on each live manager-owned DMABuffer record. DMAPool.info live_inflight remains the aggregate sum across live slots. A valid submit on slot 0 and a valid submit on slot 1 can coexist; duplicate submit on either same slot still fails closed; mismatched completion preserves that slot’s descriptor without touching other slots; matching completion of slot 0 decrements the aggregate while slot 1 remains in flight; explicit freeBuffer of an in-flight slot fails closed; and cap-release/process-exit cleanup drains only the releasing slot’s descriptor before detach. This is still bounded manager accounting and does not mutate descriptor rings, publish CQ entries, expose host physical or IOVA addresses, attempt direct DMA, program IOMMU state, or add a production driver consumer. The 2026-05-10 18:11 UTC follow-up makes the single manager-owned bounce-buffer page exclusive between userspace borrowed-VMA ownership and manager in-flight descriptor ownership for each DMABuffer cap. A valid submit while the same cap still has a live mapping fails closed as dmabuffer-mapping-live / user-mapping-live before manager in-flight accounting changes; explicit unmap restores submit. A valid map while the slot has an in-flight descriptor fails closed as dmabuffer-inflight-submission / in-flight-submission, returns addr=0, and does not publish a borrowed VMA; matching completion restores map for that slot. The lock order remains cap mapping state before manager validation, with no address-space lock held across manager state mutation. The 2026-05-10 19:29 UTC follow-up adds bounded completion data on that same manager-owned bounce-buffer page. The successful matching DMABuffer.completeDescriptor path keeps the existing manager-inflight-completed result labels, validates the active owner, pool/slot generation, queue/descriptor identity, and submitted length, then writes a deterministic byte pattern into only the accepted completionLength bytes before clearing the in-flight record. Because submit is blocked while a cap-owned borrowed VMA is live and map is blocked while the slot is in flight, the write happens while no live user VMA exists for that slot; a later successful map lets userspace read the pattern. Invalid requests, stale caps, no-inflight completions, descriptor mismatches, length-exceeded completions, mapped-live cases, and after-free calls do not write. This is userspace-visible bounce-buffer completion data, not device DMA completion: there is still no descriptor-ring mutation, CQ publication, direct DMA, host physical/IOVA exposure, IOMMU programming, or production driver consumer. Implementation note, 2026-05-10 04:40 UTC: duplicate-active valid-size and invalid-size DMAPool.allocateBuffer requests now use schema result data for domain rejection instead of an application-exception label. The no-result-cap response reports either result=dmapool-already-attached / reason=active-buffer-attached or result=dmapool-allocation-request-invalid with the exact request reason, plus sideEffect=side-effect-blocked and bufferPresent=false, before any allocation side effect. The 2026-05-10 live-accounting follow-up also carries that bounded frame into the attached DMAPool record exposed by typed DMAPool.info: the manager record starts as zero-live-dmapool-bounce-buffer, changes to userspace-bounce-buffer-live with one, two, or three live 4096-byte pages while manager-attached DMABuffer result caps exist, and returns to zero-live after typed DMABuffer.freeBuffer or cap release scrubs/releases every live bounce page. The same lifecycle proof validates that active manager-record accounting against the attached budget policy before treating allocation as usable state. This still does not expose a device-visible DMA address, IOVA, host physical address, production descriptor side effect, DMA mapping, or production driver consumer.

Implementation note, 2026-05-11 03:00 UTC: the manifest-granted manager-owned fixed bounce-buffer DMAPool path now has its own device-manager budget policy instead of importing the live virtio-net device_dma policy. The policy covers three live buffers/pages, 12288 bytes, four queues, eight descriptors per queue, one in-flight descriptor per live slot, zero MMIO mappings/bytes, and zero interrupt holds. DMAPool.allocateBuffer validates the next-live manager accounting against that policy before slot selection, frame allocation, generation allocation, result-cap minting, or manager ledger mutation. With all three fixed slots live, a fourth valid-size allocation returns no result cap and reports result=dmapool-budget-exceeded, reason=over-buffer-budget, sideEffect=side-effect-blocked, and bufferPresent=false. Imported live virtio-net proof records continue to use the device_dma:virtio-net budget policy. This remains the bounded manager-owned bounce-buffer path: direct DMA is blocked, host physical addresses and IOVAs stay hidden, descriptor rings and completion queues are not mutated, and IOMMU/remapping plus production driver consumption remain out of scope.

Implementation note, 2026-05-11 06:10 UTC: the same fixed-slot manager-owned DMAPool family now revalidates current budget accounting before publishing an allocated DMABuffer result cap, before acquire-audit publication, before parent DMAPool release intent or detach, before grant-rollback/drop/teardown detach, before pending-parent release completion, before DMABuffer page release, and before descriptor-completion state advancement. The focused grant smoke labels the full-pool budget rejection as no leaked result cap, no generation burn, no ledger mutation, and no stale authority publication. DeviceMmio and Interrupt budget propagation is not changed by this slice.

Implementation note, 2026-05-03 02:05 UTC: the schema and kernel now include a result-only DMABuffer.info skeleton that can wrap the manager-attached DmaBufferHandle record already issued inside the zero-live DMAPool lifecycle proof. The object validates the live manager-owned buffer record through validate_dmabuffer_record() and the existing pure DMA-buffer validator before returning status labels such as userspaceDmaBuffer=manager-issued-bounce-buffer, managerRecord=validated-active, bufferRecord=manager-attached-buffer, realDma=not-attempted, hostPhysicalUserVisible=false, directDma=blocked, descriptorSubmit=manager-inflight-accounting, descriptorComplete=manager-inflight-accounting, freeBuffer=bounce-buffer-page, and bootstrapGrant=blocked. The QEMU zero-live lifecycle proof constructs that cap object while the buffer record is active, records dmabuffer_cap_info_result=ok, exercises the serialized CapObject::call(0, &[]) path and decodes the returned DmaBuffer.info Cap’n Proto result as dmabuffer_cap_serialized_call_result=ok, then verifies the same cap fails closed after revoke begins as dmabuffer_cap_stale_after_revoke_result=dmabuffer-stale-handle; the same stale object also fails the serialized method-0 path as dmabuffer_cap_serialized_stale_after_revoke_result=invoke-failed. Later bounded grant slices expose this result cap through the DMAPool manifest-grant path, add typed DMABuffer.freeBuffer, and add bounded userspace bounce-buffer DMABuffer.map / DMABuffer.unmap plus manager-accounted request-shaped DMABuffer.submitDescriptor / DMABuffer.completeDescriptor; the path still exposes no real DMA, descriptor-ring mutation, CQ publication, host physical address or IOVA export, production page cleanup/reuse, or production userspace DMABuffer completion.

Implementation note, 2026-05-11 11:22 UTC: the provider-consumer smoke now uses that same bounded bounce-buffer path to prove a descriptor-ring-equivalent provider side effect. After DMABuffer.submitDescriptor validates live owner/pool/slot authority, descriptor queue/id/length bounds, no live user mapping, and no duplicate in-flight descriptor, the manager scrubs the page and writes a provider-visible shadow descriptor entry with magic, queue, descriptor id, submitted length, and flags before writing the existing submit marker and committing the in-flight record. DMABuffer.completeDescriptor still writes completion bytes only inside the validated completion length, and the smoke proves the shadow entry and marker remain visible outside that completion window. Provider-effect submits shorter than the 24-byte shadow-descriptor-plus-marker footprint are rejected as a typed no-side-effect boundary, even though the shared descriptor request shape is otherwise valid. This is a bounded bounce-buffer side effect only; no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA, IOVA, host physical address, or remapping-domain claim is added.

Implementation note, 2026-05-11 12:01 UTC: the same submit path now replaces the shadow-descriptor payload with a selected provider-owned queue entry plus marker. The entry records queue magic, queue id, tail, descriptor id, submitted length, and flags after descriptor authority validation and the submit scrub; make run-ddf-provider-consumer maps the buffer after completion and proves the queue entry and marker remain visible outside the completion window. Provider-effect submits shorter than the current 72-byte provider queue-entry-plus-marker footprint are rejected before in-flight accounting or provider-visible mutation. This remains bounded bounce-buffer evidence only: no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA, IOVA, host physical address, or remapping-domain claim is added.

Implementation note, 2026-05-12 20:30 UTC: the accepted DMABuffer.submitDescriptor path now constructs the candidate in-flight descriptor in manager-local state and validates the resulting DMAPool budget/accounting before scrubbing or writing provider-visible bounce-buffer bytes. The provider-consumer smoke snapshots the selected provider queue-entry and marker bytes, drives a short provider-effect submit rejection, and proves both those bytes and live_inflight=0 are preserved. The selected virtio-net TX publication gate remains separately bounded after provider-entry write: quiesced publication still fails with no extra pin and no doorbell, not with rollback of the already-written shadow entry.

Implementation note, 2026-05-11 14:39 UTC, branch commit f04a14f4: the selected provider-owned queue entry now carries a staged claimed virtio-net notify-offset admission record instead of only a “requires claim” gate. The selected queue 1 path records accepted notify-offset admission plus blocked wrong-queue and wrong-offset admissions after descriptor authority validation and submit scrub; a separate queue 0 submit proves non-selected queues remain neutral and blocked from selected-backend doorbell metadata. This is not a real doorbell path: no virtio-net notify BAR handle is granted, no notify register is written, no real virtio-net descriptor ring is mutated, and production userspace NIC readiness is not claimed.

Current QEMU evidence: the same make run-net path now exports a bounded live-pool snapshot from the kernel-owned virtio-net device_dma ledger and feeds it into a device-manager DMAPool record proof. The live record carries buffer/page count, live bytes, current in-flight submissions, committed/resident/unswappable flags, and scrub-before-release policy. The device-manager proof rejects both DmaMappingsRemoved and teardown detach while that authoritative ledger snapshot remains live. The proof now calls the device_dma teardown-evidence API and records the expected authoritative-ledger-live block with matching imported live accounting, then reports completion as deferred because no authoritative zero-live/scrubbed evidence is available for the live virtio-net ledger. It does not zero the imported record to simulate teardown, does not claim DmaMappingsRemoved, terminal Dead, or release for the live virtio-net pages, and does not scrub or free live virtio-net DMA pages. This is still a prerequisite record-accounting proof: the current pages remain kernel-owned, only bounded info-skeleton hardware cap grants are exposed to userspace, production page release hooks for live virtio-net DMA are not wired, IOMMU remapping is not programmed, and S.11.2 hostile smokes remain open.

The same smoke emits a separate device_dma scratch proof for the positive zero-live teardown-evidence path: teardown_evidence() fails closed before quiesce and scrub markers, rejects one-marker states, and reports authoritative-ledger-zero-live only after both markers are set. The scratch proof revalidates the live virtio-net ledger but does not mutate, zero, scrub, free, or claim teardown completion for real virtio-net DMA pages.

Implementation note, 2026-05-03 03:21 UTC: the zero-live device-manager DMAPool lifecycle proof now consumes that scratch zero-live evidence before final pool detach, and binds that evidence to the attached record source. The manager-owned zero-live record is labeled device-manager / zero-live-dmapool-bounce-buffer; imported live records keep the source labels from the authoritative device_dma snapshot. After the proof-scoped buffer record is actively freed and detached, a zero-live pool detach with mismatched virtio-net / kernel scratch evidence fails closed as dmapool-zero-live-evidence-invalid, and detach without authoritative evidence fails closed as dmapool-zero-live-evidence-absent. Only scratch authoritative-ledger-zero-live evidence carrying the same record source plus both quiesce and scrub markers allows the manager-owned record to detach and the revocation path to advance to DmaMappingsRemoved, Dead, and release. This remains scratch/no-real-DMA evidence: it does not tear down live virtio-net pages, program or remove IOMMU mappings, expose userspace DMAPool mapping/descriptor authority, or claim production page cleanup beyond the bounded manager-attached bounce page.

The current scratch proof set also covers stale DMA page handles without touching real virtio-net pages: reusing the same synthetic physical page bumps the DMA page generation, the old handle fails closed as stale-dma-handle, wrong-queue and wrong-label frees preserve the active page record, and duplicate free remains rejected. Production userspace DMAPool stale-handle smokes, descriptor-abuse coverage, revoke/reset races, and real quiesce/scrub/release remain open.

Implementation note, 2026-05-02 23:23 UTC: the kernel-owned virtio-net device_dma page release path now validates the DeviceDmaAllocation against the live ledger before any scrub or frame-allocator call, scrubs the frame through the HHDM mapping, removes the ledger entry only after scrub succeeds, and then returns the frame to the allocator. The frame scrub helper checks frame alignment, HHDM/allocator initialization, range, and allocated state before zeroing the page. make run-net emits a bounded device-dma: release scrub proof line using a proof-only kernel-owned page: stale generation, wrong queue, and wrong label release attempts fail before scrub, frame free, or ledger mutation; the active release path records scrub_before_frame_free=true and ledger_removed_after_scrub=true. This is still no-userspace-handle/no-real-teardown evidence for the current kernel-owned virtio-net path only; production DMAPool handles, real device-manager teardown hooks, IOMMU/bounce-buffer mapping removal, hostile stale-DMA smokes, and full page lifecycle cleanup remain open.

Implementation note, 2026-05-02 12:28 UTC: the same scratch proof family now covers stale DMA completion ordering without touching real virtio-net pages. A synthetic reused DMA page slot bumps generation, stale completion validation checks the page generation before queue-completion accounting, and the old completion fails closed as stale-dma-handle before completion counters can underflow or the reused page/submission state can mutate. This is still prerequisite evidence only: production userspace DMAPool hostile smokes, reset/revoke races with outstanding descriptors, CQ notification publication, real quiesce/scrub/release, and IOMMU or bounce-buffer teardown remain open.

Implementation note, 2026-05-02 14:03 UTC: that scratch stale-DMA-completion proof now adapts the synthetic DMA buffer slot into capos-lib::device_authority before completion accounting can mutate. The QEMU line records current-handle validation as ok, stale same-slot reuse as stale-slot-generation, and side-effect-blocked, then preserves the existing stale-dma-handle completion outcome. This remains a scratch/no-real-DMA validator adapter, not production userspace DMAPool authority or S.11.2 hostile-smoke completion.

Implementation note, 2026-05-02 15:15 UTC: the same scratch proof family now adds a paired stale-completion publication check. A synthetic reset bumps the device owner generation so an old completion fails as stale-owner-generation with side-effect-blocked before any CQ publication. The same-slot reuse path then fails the old completion as stale-dma-handle, preserves the new submission accounting, and records both cq_publication_blocked=true and new_owner_exposure_blocked=true. This is still scratch/no-real-DMA evidence; production userspace DMAPool completion notification, real hardware stale-completion injection, reset/revoke races, and IOMMU or bounce-buffer teardown remain open.

Implementation note, 2026-05-02 12:46 UTC: the device-manager interrupt handoff proof now includes a bounded stale IRQ after-detach check. After an attached interrupt route is detached, the proof delivers the old LAPIC vector through the dispatch path and requires stale_irq_delivery_after_detach to be unregistered with stale_irq_wake_blocked=true; the old route handle also continues to fail as interrupt-stale-route. This is prerequisite evidence for route teardown ordering only. It does not cover a pending hardware IRQ across reset, userspace Interrupt waiter wakeup semantics, or reassignment reuse by a new owner.

Implementation note, 2026-05-02 14:24 UTC: the same device-manager interrupt handoff proof now adapts the attached source into capos-lib::device_authority before exercising the active wait path. After revoke begins, the old handle fails the pure validator as stale-owner-generation with side-effect-blocked, then the proof preserves the existing interrupt-stale-route, detached-vector unregistered, and stale_irq_wake_blocked=true checks. This remains a proof adapter, not production userspace Interrupt handles, real waiters, reset/reassignment reuse proof, or S.11.2 hostile-smoke completion.

Implementation note, 2026-05-02 14:51 UTC: the interrupt handoff smoke now adds two bounded stale IRQ ordering points. After revoke begins and while the route is still attached, delivery to the old vector remains masked, matches the attached route generation, reports wake blocking, and leaves the old route delivery count unchanged. During the reset phase, a synthetic route-registry same-vector reuse proof re-registers and claims the route with bumped source and route generations, then shows delivery to that vector is still masked, matches the new route generation, and leaves the reused route delivery count unchanged. This is route-manager prerequisite evidence only: it is not a true pending hardware MSI/reset hostile smoke, does not involve userspace Interrupt waiters, and does not prove DMA buffer reuse race closure.

Implementation note, 2026-05-03 18:48 UTC: the interrupt handoff smoke now snapshots a bounded pending IRQ token from the old vector, source id, source generation, and route generation before revocation. Checking that token after revoke blocks as stale-pending-irq-masked with reason route-masked; after detach it blocks as stale-pending-irq-unregistered; and after reset/reuse it blocks as stale-pending-irq-generation with reason source-route-generation-mismatch. Each check records side-effect-blocked, wake blocking, and unchanged delivery counts, and the reset/reuse check records that the new route did not receive a delivery count. The same proof rejects a malformed pending token with a zero generation as stale-pending-irq-invalid-state before any delivery-count mutation. This was bounded token-generation evidence only and did not inject a real pending MSI across reset; the real-int $vector injection added at 2026-05-05 18:17 UTC (see below) closes the S.11.2.7 stale IRQ hostile-smoke gate row by exercising the production CPU exception entry path across the same revoke, detach, and reset/reuse boundaries.

Implementation note, 2026-05-09 18:12 UTC: the pending IRQ token decision path now has a pure capos-lib::device_authority validator. Host tests cover current-route acceptance and the same fail-closed label space used by the kernel/QEMU proof: stale source generation, stale route generation, both generations changed after reuse, source mismatch, route masked, route unregistered, invalid route state, invalid owner, malformed zero-generation or unassigned source identity, and unsupported vector. The kernel still snapshots the live dispatch slot and delivery count, but delegates the pending-token identity/state decision to that shared helper before returning stale-pending-irq-* labels. This is validator/adapter coverage only; it does not expose production userspace Interrupt waiters or wait/ack/mask/unmask authority.

Implementation note, 2026-05-05 18:17 UTC: the interrupt handoff smoke now fires a real INT $vector instruction at the device MSI vector at three points across the revoke/reset/reuse boundary, exercising the production IDT entry, extern "x86-interrupt" stub, record_lapic_delivery dispatch slot read, and LAPIC EOI write rather than the helper-call path the prior proofs used. The new proof scope strings drop “no-real-msi” and read stale-vector-after-detach-real-int-vector-injected, manager-attached-claimed-masked-after-revoke-real-int-vector-injected, route-registry-vector-reuse-during-reset-real-int-vector-injected, and bounded-pending-irq-token-generation-real-int-vector-cross-reset-injection. Each injection point requires the slot’s delivery count to remain zero before and after the real INT, and the post-INT outcome to match masked (after revoke, route still attached but masked), unregistered (after detach, slot cleared), and masked (after reset/reuse, slot now belongs to a freshly registered+claimed route with bumped source and route generations). The proof emits a closure summary s11_2_7_proof_scope=s11-2-7-stale-irq-after-reset-real-int-vector-cross-reset-injection-no-userspace-waiter, s11_2_7_real_irq_injected_across_reset=ok, s11_2_7_old_waiter_cannot_wake_new_owner=true, and s11_2_7_stale_ack_blocked=true. This closes the S.11.2.7 row of the hostile-smoke acceptance matrix at make run-net (which invokes tools/qemu-net-smoke.sh). It does not yet create a userspace Interrupt waiter object; the in-flight delivery is observed via the kernel-owned dispatch slot atomic state machine that the production path consumes. S.11.2.9 hostile-smoke gate-wiring closed 2026-05-05 20:49 UTC (see the implementation note below).

Implementation note, 2026-05-05 19:37 UTC: the device-manager hostile-smoke suite now closes the S.11.2.8 stale-DMA-completion-after-reset row. prove_qemu_stale_dma_completion_handoff claims a fresh probe-then-driver record on the virtio-net PCI BDF (separate from the live virtio-net driver state) and walks it through the same revoke, detach, and reset/reuse boundaries S.11.2.7 uses. At each boundary the proof allocates a real virtio-net DMA page through the production device_dma::allocate_virtio_net_page helper, frees the page through device_dma::free_virtio_net_page, reallocates so the live ledger’s page generation advances, and synthesizes a stale DeviceDmaAllocation keyed to the live phys with a decremented generation. The synthesized stale handle is then fed to the production device_dma::record_virtio_net_completion_for_allocation path – the same function the live virtio-net Virtqueue::record_used_completion_for_allocation invokes after descriptor tracking validates a hardware used-ring entry. The production validator rejects each stale injection as stale-dma-handle before any queue accounting decrement, completion side effect, CQ publication, or new-owner memory exposure. The bounded run-net proof records real_completion_inject_after_revoke_result=stale-dma-handle, real_completion_inject_after_detach_result=stale-dma-handle, real_completion_inject_after_reset_reuse_result=stale-dma-handle, all three with side-effect-blocked, queue_account_preserved=true, live_page_preserved=true, cq_publication_blocked=true, new_owner_exposure_blocked=true, freed_buffer_unchanged=true, and generation_bumped=true, plus a closure summary s11_2_8_proof_scope=s11-2-8-stale-dma-completion-after-reset-real-free-realloc-cross-revoke-detach-reset-reuse-no-userspace-dmapool, s11_2_8_real_completion_injected_across_reset=ok, s11_2_8_old_completion_cannot_publish_to_new_owner=true, s11_2_8_freed_buffer_reuse_blocked=true, and s11_2_8_accounting_underflow_blocked=true. The new shape is enforced in tools/qemu-net-smoke.sh and runs from make run-net. This is the production paired stale-DMA-completion proof showing old completions cannot publish stale CQ notifications or expose new-owner memory after real revoke, detach, and reset/reuse boundaries with real free + realloc page generation advances on the live kernel-owned ledger; S.11.2.9 hostile-smoke gate-wiring closed 2026-05-05 20:49 UTC (see the implementation note below). Userspace DMAPool handles and real device-manager page quiesce/scrub/release hooks remain open as separate follow-ups.

Implementation note, 2026-05-05 20:49 UTC: the S.11.2.9 hostile-smoke coverage row of the acceptance matrix is closed by aggregating every matrix-row proof line into the make run-net -> tools/qemu-net-smoke.sh gate. Every proof line referenced by the matrix has at least one assertion in the harness today; the assertion shape varies by row. The two driver-crash lines wired by that gate slice, the existing S.11.2.8 device-manager: dma completion handoff proof closure-summary line, and the S.11.2.7 device-manager: interrupt handoff proof closure-summary line (whose trailing anchor was added by this slice for harness-strictness consistency with S.11.2.8) all use anchored extended-regex assertions (field-by-field match plus proof_result=ok[[:cntrl:]]?$ trailing anchor); other matrix-row rows reuse the harness’s pre-existing mix of unanchored extended-regex and fixed-string grep -Fq assertions on the emitted proof lines. The two previously unasserted lines wired by this slice are device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio and device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt. Both proofs were already emitted by the kernel on every boot (via prove_qemu_devicemmio_driver_crash_hook and prove_qemu_interrupt_driver_crash_hook in kernel/src/device_manager/proofs.rs) and exercise the explicit driver-crash teardown trigger path with a stale rerun noop, validate-live revoked cap state, and cap_release_after_crash as noop. The chosen gate strategy keeps S.11.2.9 inside make run-net rather than splitting into a separate make run-hostile-smokes target, because all six matrix rows depend on the same virtio-net device bring-up state (probe-then-driver records on the virtio-net BDF, real IDT vector injection, real DMA page free + reallocate). A separate target would duplicate the bring-up cost without adding coverage. Tightening the remaining unanchored assertions to the same anchored shape is a follow-up harness-hardening task and is not part of S.11.2.9 closure because each affected proof line is still uniquely identified by its emitted prefix and the asserted field set. Production userspace DMAPool/DeviceMmio/Interrupt handles, real device-manager page quiesce/scrub/release hooks, hardware-backed provider-driver Interrupt wait/ack dispatch beyond the current bounded route-dispatch waiter proof, durable/signed production audit consumption beyond the first volatile HardwareAuditLog.snapshot cap, and IOMMU domain programming all remain open as separate follow-ups tracked in docs/backlog/hardware-boot-storage.md and the loopyard userspace-driver-transition tasks.

Implementation note, 2026-05-08 09:44 UTC: the same make run-net gate now also asserts cap-specific DMA driver-crash proofs. DmaBufferCap routes the explicit trigger through the bounded FreeBuffer cleanup path and proves page scrub/ledger/frame-free labels before stale rerun and post-trigger cap release both return noop; DmaPoolCap routes the explicit trigger through the zero-live evidence-gated detach path and proves authoritative zero-live, quiesced, and scrubbed evidence labels before stale rerun and post-trigger cap release return noop.

Implementation note, 2026-05-03 01:05 UTC: the schema and kernel now include a result-only Interrupt.info skeleton that can wrap a manager-issued device handle plus the attached DeviceInterruptRoute. The object validates the live manager record, owner, claimed route, and attached route record through validate_interrupt_record() before returning status labels such as userspaceInterrupt=manager-issued-skeleton, managerRecord=validated-active, routeRecord=manager-attached-route, realInterruptDelivery=not-delivered, wait=admission-check-only, acknowledge=admission-check-only, mask=route-state-control, unmask=route-state-control, and bootstrapGrant=blocked. The interrupt handoff QEMU proof constructs that cap object while the route record is active, records interrupt_cap_info_result=ok, exercises the serialized CapObject::call(0, &[]) path and decodes the returned Interrupt.info Cap’n Proto result as interrupt_cap_serialized_call_result=ok, then verifies the same cap fails closed after revoke begins as interrupt_cap_stale_after_revoke_result=interrupt-stale-handle; the same stale object also fails the serialized method-0 path as interrupt_cap_serialized_stale_after_revoke_result=invoke-failed. A later manifest-grant smoke explicitly releases the granted Interrupt cap through CAP_OP_RELEASE and proves a subsequent typed Interrupt.info call fails closed from userspace. A focused grant-cycle smoke now repeats that grant, release, and stale-info proof twice in sequence and asserts the second manager-grant-source acquire preserves the source generation and receives a fresh route generation after the first release; the same smoke also decodes both acquire/release cycles through the typed volatile HardwareAuditLog.snapshot surface. The focused hardware-audit interrupt-waiter smoke also decodes recent boot-time DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable lifecycle records from the current volatile 16-record snapshot window. The same smoke now uses the startSequence cursor to decode older retained DeviceMmio lifecycle rows that the default latest 16-record tail has skipped. A 2026-05-09 19:18 UTC follow-up adds a bounded Interrupt.wait admission method to that skeleton. The method validates the same manager-attached route, snapshots a pending-token candidate, delegates to the shared capos-lib::device_authority pending-IRQ validator, and returns typed labels through capos-rt; the focused grant smoke asserts the current masked-route result stale-pending-irq-masked, reason route-masked, side-effect-blocked, matching token/current source and route generations, unchanged delivery counts, no waiter wake, and fail-closed behavior after cap release. This is bounded manager-issued skeleton evidence only: there is no blocking wait, real hardware acknowledgement, real hardware mask/unmask side effect, interrupt delivery authority, real waiter object, production Interrupt completion, durable/signed audit persistence, or concurrent sharing claim. A 2026-05-09 23:21 UTC follow-up adds bounded Interrupt.acknowledge admission to the same skeleton. It validates the manager-attached route through the existing Acknowledge authority path, returns admission-check-only, interrupt-ack-not-attempted, and side-effect-blocked, and proves delivery counts remain unchanged with no waiter wake or hardware acknowledgement. A 2026-05-09 23:52 UTC follow-up adds bounded Interrupt.mask and Interrupt.unmask admission to the same skeleton. They validate the manager-attached route through the existing Mask and Unmask authority paths, return admission-check-only, interrupt-mask-not-attempted / interrupt-unmask-not-attempted, and side-effect-blocked, and prove route state and delivery counts remain unchanged with no hardware mask/unmask, waiter wake, or IRQ delivery. A 2026-05-10 04:01 UTC follow-up promotes those methods to bounded route-state control over the manager-attached dispatch slot. Interrupt.unmask now changes claimed-masked to driver-unmasked, Interrupt.mask changes it back to claimed-masked, and both preserve delivery counts while still avoiding hardware MSI/MSI-X table programming, waiter wakeup, hardware acknowledgement, or real IRQ delivery. A 2026-05-10 22:54 UTC follow-up wires real waiter completion to the existing route-dispatch delivery counter from scheduler/poll context. The poller observes matching delivered routes by vector, source generation, and route generation without taking the waiter-table lock in the IRQ dispatch path, then revalidates the manager-attached route under the route-post exclusion before posting the deferred cap completion. The focused grant smoke proves the first unmasked manifest-granted wait completes as interrupt-delivered / waiter-completed-irq with real_interrupt_delivery=delivered and an advanced delivery count, while a second unmasked wait still remains pending until Interrupt.mask completes it through the existing interrupt-waiter-cancelled / waiter-completed-no-irq path. Stale, masked, released, reset, or reused routes do not wake as delivered IRQs. This remains a bounded route-dispatch proof; it does not program hardware MSI/MSI-X tables, acknowledge hardware, add provider-driver interrupt consumption, or claim hostile hardware isolation.

Implementation note, 2026-05-03 13:49 UTC: the result-only DMAPool.info, DMABuffer.info, DeviceMmio.info, and Interrupt.info surfaces now return numeric identity fields alongside the conservative status labels. The fields mirror the documented handle identity shape: deviceId, BDF bus/device/function, owner generation, and the relevant pool id/generation, buffer slot/generation, BAR/mapping id/generation, or interrupt source/generation/route generation. The QEMU proof logs and net smoke assert the active serialized method-0 decode for those fields, while stale method-0 calls still fail closed as invoke-failed. This remains a result-only manager-issued skeleton surface; it does not add production DMA allocation, free, map, submit, or completion authority, production MMIO mapping or doorbell authority, production interrupt wait/ack/mask/unmask authority, real DMA page cleanup/reuse, hostile hardware isolation, or S.11.2 completion.

Implementation note, 2026-05-03 16:37 UTC: the bounded interrupt route identity skeleton now carries separate source and route generations end to end. DeviceInterruptRoute, LegacyIoApicInterruptRoute, route records, diagnostics summaries, dispatch-slot metadata, and the device-manager attached interrupt bridge store both generations. Registration allocates both fields, PCI MSI-X route reassignment preserves source generation while bumping only the route generation, release/re-register allocates both generations fresh, and Interrupt.info returns the independent values. The QEMU net smoke asserts the split in PCI and legacy route logs, metadata proof logs, handoff proof logs, serialized Interrupt.info, and cap-release proof logs. This closes only the bounded identity proof gap; it does not expose production userspace Interrupt authority, create real waiters, or complete the S.11.2 hostile IRQ smoke matrix.

`DeviceMmio` Invariants

DeviceMmio is register authority, not memory authority.

Authority: A holder may map only BARs or subranges recorded in the claimed device object. It may not map PCI config space globally, another function’s BAR, RAM, ROM, or synthetic kernel pages.
Handle identity: Each call checks the claimed device id, owner generation, BAR or subrange mapping record, and mapping generation before mapping, unmapping, reading, or writing registers.
Physical range: Each mapping is bounded to the BAR’s decoded physical range, page-rounded by the kernel, and tagged as device memory with cache attributes appropriate for MMIO. Partial BAR grants must preserve page-level isolation; otherwise the grant must cover the whole page-aligned register window and be treated as that much authority.
Ownership: At most one mutable driver owner controls a device function’s MMIO at a time. Management capabilities may inspect topology, but register writes require the claimed DeviceMmio object.
No DMA implication: Mapping registers does not grant any DMA buffer, frame allocation, interrupt, or config-space authority. Doorbell writes are accepted only as effects of register access; descriptor validity is enforced by DMAPool before queues are made visible to the device.
Revocation: Revocation unmaps the driver’s register pages, marks the device object unavailable for new calls, and invalidates outstanding MMIO handles. Stale mappings or calls fail closed.
Reset: Revoking the final mutable DeviceMmio owner resets or disables the device unless a higher-level device manager explicitly transfers ownership without exposing it to an untrusted holder.

Shared PCI COMMAND / BAR-mapping fail-closed helpers

The shared kernel/src/pci.rs helpers that bring up a device function enforce the fail-closed DMA posture so it does not rely on the accident that the bus-master-enabling helpers happen to be called only from proof/fixture paths:

Bus mastering requires an explicit DMA-authorization assertion. enable_memory_space_and_bus_master and enable_io_space_and_bus_master take a BusMasterAuthorization witness. Because backend selection is a runtime, fail-closed decision that is legitimately either a direct IOMMU remapping domain or the kernel-owned bounce-buffer arrangement (see the runtime selection rule above), the precondition cannot be a single “IOMMU is established” runtime query; it is asserted at the call site through the witness. Each BusMasterAuthorization::assert_dma_domain_established() construction is the greppable point at which a caller takes responsibility for having established a DMA-authorization domain before DMA authority is enabled.
MMIO-only decode never leaves bus mastering set. enable_memory_space (BAR decode without DMA authority) actively clears bus mastering rather than leaving it untouched: the written COMMAND is (command | MEMORY_SPACE) & !BUS_MASTER, and the call fails closed if the read-back COMMAND still has bus mastering set (e.g. a sticky BM bit).
Teardown clears bus mastering. clear_bus_master clears the BM bit leaving other COMMAND bits intact. Every BM-enabling proof/fixture path calls it on its error and teardown paths so no device retains DMA authority after the proof drops it: the RX-autonomous MSI-X and legacy-select paths, the gVNIC adminq-register and raw-frame proofs (folded into their restore_pci_command teardown so bus mastering is cleared even if it had been set before the proof ran, rather than relying on restoring the pre-proof COMMAND), and the legacy datapath one-shot and ICMP-echo proofs plus the persistent legacy NIC runtime (which clears BM on init failure and on runtime end-of-lifetime teardown). A device reset (status = 0) does not clear PCI COMMAND bus mastering, so the explicit clear_bus_master is required even where the path already resets the device.
No mapping of an unprogrammed BAR. memory_bar_region and map_bar_region refuse a BAR whose base is physical address 0 (an unprogrammed BAR firmware never assigned a window). Mapping 0 + offset would decode as real low physical memory rather than the device’s MMIO window, so the region/map is rejected fail-closed. The guard lives in both the region-builder and the address-computing map step, because PciBarRegion values may be constructed directly by a caller and not only via memory_bar_region.

`Interrupt` Invariants

Interrupt is event authority for one routed source.

Authority: A holder may wait for, mask/unmask where supported, and acknowledge only its assigned vector, line, or MSI/MSI-X table entry. It may not reprogram arbitrary interrupt controllers or claim another source.
Handle identity: Each wait, mask, unmask, and acknowledge checks the claimed device id, owner generation, source id, source generation, route generation, and any live waiter generation before affecting delivery state.
Ownership: Each interrupt source has one delivery owner at a time. Shared legacy lines must be represented as a kernel-demultiplexed object with explicit device membership, not as ambient access to the whole line.
Range: The capability records the hardware source, vector, trigger mode, polarity, and target CPU/routing state. User-visible operations are checked against that record.
Revocation: Revocation masks or detaches the source, drains pending notifications for the old holder, invalidates waiters, and prevents stale acknowledgements from affecting a new owner.
Reset: If the source cannot be detached cleanly, the owning device is reset or disabled before the interrupt is reassigned.
No MMIO or DMA implication: Interrupt delivery does not grant register access, DMA buffers, or packet memory.

Revocation Ordering

Device revocation must follow a fixed order:

Stop new submissions by invalidating the driver’s user-visible handles.
Revoke MMIO write authority by write-blocking or unmapping BAR pages, or by disabling the device before any DMA teardown starts.
Mask or detach interrupts.
Quiesce virtqueues or device command queues.
Reset or disable the device if in-flight DMA cannot be accounted for.
Remove IOMMU mappings or invalidate bounce-buffer handles.
Scrub and free DMA pages.

This order prevents a stale driver from racing revocation with doorbell writes, interrupt acknowledgement, or descriptor reuse. Logical handle invalidation is not sufficient while a BAR remains mapped; register-write authority must be removed or the device must be disabled before descriptor or DMA-buffer ownership is reclaimed.

Implementation should represent the order as an explicit device-owner state machine rather than as ad hoc booleans:

#![allow(unused)]
fn main() {
enum DeviceOwnerState {
    Active,
    RevokingHandles,
    MmioRevoked,
    InterruptsDetached,
    QueuesQuiesced,
    Resetting,
    DmaMappingsRemoved,
    Dead,
}
}

No path may free or reassign DMA pages until the state has reached QueuesQuiesced with all in-flight descriptors accounted for, or Resetting has completed and the device can no longer write old buffers. Dead means all user-visible handles are invalid, interrupts are detached or masked, DMA mappings are removed, and pages have been scrubbed or transferred to a trusted owner.

Hard invariants:

DMA pages cannot be freed before QueuesQuiesced or a completed Resetting transition proves old DMA writes are stopped.
MMIO write authority must be revoked before DMA ownership teardown.
Interrupt reassignment cannot happen before old pending notifications are drained or generation-invalidated.
Device reset is mandatory if in-flight DMA cannot be proven stopped.

Future Userspace-Driver Transition Criteria

Moving NIC or block drivers out of the kernel is gated by Security Verification Track S.11.2. The gate is only open when all rows below are implemented and demonstrated. The S.11.2.N labels are local checklist row IDs for this gate.

The completed Device Driver Foundation selected milestone used this track as the prerequisite for the DMAPool, accounting, and hostile-smoke sub-gate. Future DDF follow-ups still use these rows as the userspace-driver transition gate: generic MSI/MSI-X dispatch and second-device reuse may land first, but userspace DeviceMmio and Interrupt exposure stays blocked until these rows pass.

Production `DMAPool` Ledger Prerequisite

Before userspace NIC or block drivers receive DeviceMmio, Interrupt, or DMAPool handles, the device manager must own one ledger of record for each claimed device. That ledger is the authoritative source for every device-visible hold, not a diagnostic mirror of separate subsystems.

The ledger records:

DMA pool bytes reserved and live;
DMA buffer count, slot generation, and owner generation;
mapped userspace DMA VMAs, quiesce state, scrub state, and release eligibility for each attached DMA pool;
descriptor and ring depth limits, including live in-flight submissions and completions;
page-rounded MMIO mappings and their owning DeviceMmio generations;
interrupt holds, waiter generations, and routed-source generations;
budget and OOM policy for allocation, queue growth, mapping, and interrupt attachment;
teardown state in the device-owner state machine.

Every operation that creates, consumes, or releases device-visible authority must update this ledger as part of the same ownership transaction that changes device-manager state. That includes DMA buffer allocation/free, descriptor submission, completion accounting, BAR mapping/unmapping, interrupt attach/detach, reset, revoke, process exit, and capability release.

Implementation note, 2026-05-03 13:18 UTC: the QEMU virtio-rng metadata path now runs a bounded teardown-trigger proof for cap-release, process-exit, driver-crash, reset-disable, interrupt-waiter, future-devicemmio, and future-dmapool. Each trigger row sequentially claims and transfers the same PCI function, begins revocation, walks the existing device-owner state machine to Dead, releases only after Dead, and proves generation bumps, stale handle rejection, direct state-skip rejection, pre-Dead release rejection, and per-trigger coverage without duplicates. The cap-release row attaches a bounded manager-owned DeviceMmio record to the active driver handle, removes a DeviceMmioCap from a cap table, runs the CapOpRelease hook, and records cap-table removal plus detached/stale manager validation before normal revocation. The process-exit row attaches the same bounded DeviceMmio record shape to a real proof Process, runs Process::release_caps_for_exit(), and records cap-table removal plus detached/stale manager validation before normal revocation. The driver-crash, reset-disable, and interrupt-waiter rows register and claim bounded PCI MSI-X lifecycle-probe routes, attach them to the device manager, prove InterruptsDetached is blocked as interrupts-attached, detach and release the routes while still in MmioRevoked, and then advance normally. The future-devicemmio row attaches a bounded manager-owned DeviceMmio record from the first decoded PCI memory BAR, proves MmioRevoked is blocked as devicemmio-attached, detaches while still in RevokingHandles, and then advances normally. The future-dmapool row attaches a bounded zero-live DMAPool record, proves DmaMappingsRemoved is blocked as dmapool-attached, detaches while still in Resetting, and then advances normally. The generic teardown-trigger summary reports no label-only rows and seven object-backed rows, while the route-aware interrupt handoff smoke also labels the claimed MSI-X route as bounded interrupt-waiter blocker evidence: interrupt_waiter_object=interrupt-route-record, interrupt_waiter_block_state=InterruptsDetached, interrupt_waiter_block_result=interrupts-attached, interrupt_waiter_detach_result=ok, and interrupt_waiter_route_generation_preserved=true. This bounded route-record evidence is contract proof for the shared ownership transaction only: it does not expose production userspace authority handles, real MMIO, real DMA, a userspace waiter, or production crash/reset observers. Separate first DeviceMmioCap, InterruptCap, DmaPoolCap, and DmaBufferCap release-hook proofs now exercise both the production ring CAP_OP_RELEASE dispatch path and a real Process::release_caps_for_exit() path for those cap objects, validating cap-table removal plus exact manager-owned DeviceMmio detach, manager-attached interrupt-route release, bounded zero-live DMAPool detach, or proof-owned DMA-buffer record cleanup. The generic route-record trigger rows and remaining DMA production work do not yet implement production observers, production interrupt-waiter objects, userspace DeviceMmio, production userspace DMAPool/DMABuffer authority, full device authority, or true pending hardware MSI/reset-hostile route teardown.

Implementation note, 2026-05-08 10:08 UTC: the first cap-specific reset/disable trigger entry points now exist for DeviceMmioCap and InterruptCap. trigger_reset_disable_for_devicemmio and trigger_reset_disable_for_interrupt route through the same idempotent stale-safe detach helpers as cap release and driver-crash cleanup, emit one cap-audit: ... event=reset-disable detach=ok line on the first successful trigger, and keep stale reruns silent. This is still bounded trigger plumbing: the reset/disable observer, non-proof DMA cleanup integration, userspace MMIO/interrupt operations, and IOMMU-backed remapping work remain future requirements.

Implementation note, 2026-05-08 10:39 UTC: the DMA caps now have the matching cap-specific reset/disable trigger plumbing. DmaPoolCap::on_reset_disable uses the same authoritative zero-live/quiesced/scrubbed evidence-gated detach as cap release and driver-crash cleanup. DmaBufferCap::on_reset_disable reuses the bounded FreeBuffer authority validation and page-scrub/frame-free cleanup path, then leaves the parent pool attached until staged zero-live cleanup. make run-net asserts the dmabuffer-reset-disable-hook and dmapool-reset-disable-hook proof lines, stale rerun noop, revoked cap validation, post-trigger release noop, and exact-one cap-audit: cap={dmabuffer,dmapool} event=reset-disable lines. This is still proof-owned no-real-DMA cleanup; production userspace DMAPool/DMABuffer authority and non-proof page lifecycle integration remain future work.

Budget or OOM failure is closed before the driver can observe a new handle, program a descriptor, map MMIO, attach an interrupt, or ring a doorbell. A failed submission must leave no live descriptor hold behind, or must leave an explicit in-flight record that teardown can drain or reset. A completed teardown must reconcile the ledger to zero live DMA buffers, zero live MMIO mappings, zero interrupt holds, and no in-flight descriptor submissions for the released device generation.

Implementation note, 2026-05-02 06:59 UTC, updated 2026-05-11 06:10 UTC: the current kernel-owned virtio-net ledger now proves the closed budget/OOM cases above with a scratch ledger and the live ledger validation described earlier. Imported live device-manager DMAPool records still preserve the device_dma:virtio-net source policy and prove imported live accounting stays within its aggregate in-flight budget while preserving that policy’s per-queue queue/submission depth limits. The manifest-granted manager-owned bounce-buffer DMAPool path now attaches its own device-manager budget policy to userspace DMAPool.allocateBuffer handle creation and the current fixed-slot DMAPool/DMABuffer transfer, release, pending-release, drop, rollback, teardown-detach, page-release, and descriptor-completion cleanup paths. The full eight-slot pool fails as dmapool-budget-exceeded / over-buffer-budget before allocation, cap minting, or ledger mutation, and the selected release paths revalidate current or next accounting before advancing manager-owned state. Production userspace DMAPool records must still attach budget checks to broader provider-driver transfer/revoke/reset transactions, IOMMU or direct-DMA mapping state, and non-fixed-slot allocation before this row can be treated as the complete userspace-driver transition gate.

Implementation note, 2026-05-02 08:33 UTC: the QEMU virtio-rng metadata path now runs a bounded DMAPool record lifecycle proof on the device-manager teardown state. The first slice keeps the record zero-live: it records a pool slot, pool generation, and owner generation, rejects stale and owner-mismatched attach attempts, rejects duplicate attachment, and proves that begin_revocation invalidates the user-visible pool handle by bumping the device owner generation. The ordered teardown path now fails closed with dmapool-attached if it tries to enter DmaMappingsRemoved while the pool record remains attached. The current continuation also proves that the revoke handle cannot detach the zero-live pool without scratch authoritative zero-live, quiesced, and scrubbed evidence bound to that record’s source; a mismatched scratch source is rejected before detach. With matching proof-scoped evidence, the record detaches after queues are quiesced/reset and before DmaMappingsRemoved. Later bounded manifest grants expose conservative DMAPool, DeviceMmio, and Interrupt surfaces; the current DMAPool grant can mint only eight fixed manager-attached proof DMABuffer result caps. The remaining gap is production userspace authority, allocation beyond those eight fixed slots, real device-visible page allocation through the device manager, non-proof DMA page lifecycle integration, IOMMU remapping, and the S.11.2 hostile smoke matrix.

Current QEMU evidence: the QEMU virtio-net path now adds the corresponding imported live-accounting prerequisite proof. A device-manager DMAPool record is attached with accounting derived from the live device_dma ledger: live buffer/page count, live bytes, current in-flight submissions, committed/resident/unswappable residency flags, and scrub-before-release policy. DmaMappingsRemoved fails closed with dmapool-attached while the record remains attached, direct teardown detach fails closed with dmapool-live while the authoritative ledger remains live, and the live proof consumes the device_dma teardown-evidence API, observes authoritative-ledger-live with matching imported live accounting, and explicitly defers completion with no real DMA teardown attempted. The same proof path now validates the imported DMAPool record through capos-lib::device_authority for the active handle and stale-after-revoke failure labels. This does not create production userspace handles, real page-release hooks, IOMMU mapping invalidation, scrubbed release, terminal Dead, or hostile-smoke coverage for the live virtio-net record. The companion scratch-ledger proof covers the positive zero-live teardown-evidence result without claiming that the live virtio-net record has been torn down. The manager-owned zero-live lifecycle proof consumes matching-source device-manager teardown evidence for the positive detach/DmaMappingsRemoved path and separately proves mismatched-source and missing-evidence detach attempts fail closed. The manifest-granted bounded DMAPool path now keeps mapped userspace VMA count, in-flight descriptor holds, residency, quiesce/scrub state, and release eligibility in that manager record. Borrowed or device-visible pages remain committed, resident, unswappable, generation-bound, and unavailable for reuse until the manager record is zero-live, unmapped, quiesced, and scrubbed. Descriptor submission is refused while a buffer is borrowed to userspace, and release consumes manager-owned teardown evidence instead of proof-only device_dma zero-live evidence. The corresponding DMAPool.info ABI reports mapped VMA count, quiesce state, scrub state, and release eligibility for QEMU proof assertions. This is still bounded bounce-buffer lifecycle authority only: direct DMA, host physical or IOVA exposure, IOMMU/remapping, production provider-driver consumption, durable audit, and broader transfer/revoke policy remain future work.

Gate item	Required state	Must-have proof
S.11.2.0 DMA-owned buffers	`DMAPool` owns every driver-visible DMA mapping.	A driver receives opaque buffer handles or IOVA-only values; no path hands out raw host physical addresses.
S.11.2.1 Bound checks	Allocation, descriptor chain length, alignment, segment length, and ring depth are bounded and constant-time validated before ring submission.	Ring submissions fail closed on overflow, wrap, stale-handle, and freed-handle reuse attempts.
S.11.2.2 Explicit remap/ownership	`DeviceMmio` can only grant claimed BAR pages; cache attributes and write policy are enforced.	Driver cannot access unclaimed BARs, ROM, RAM pages, config-space globals, or stale mappings after revoke.
S.11.2.3 Interrupt correctness	`Interrupt` owns exactly one logical source at a time and drains/waits only for that source.	Reassigning an owner invalidates old waiters and masks or detaches the source first.
S.11.2.4 Quiesce + reset contract	Device manager can force reset/disable on failed revoke or teardown.	No in-flight descriptor may continue touching freed buffers after driver removal.
S.11.2.5 Process lifecycle	Capability release, process exit, and process-spawn cleanup paths cannot leak DMA pages/MMIO/intr ownership.	Crash-path teardown removes holds and invalidates user-visible handles before page free.
S.11.2.6 Isolation and accounting	Security Verification Track S.9 quota and authority ledgers include DMA, MMIO, and interrupt hold edges.	A malicious or buggy driver cannot consume more than its allocated authority budget.
S.11.2.7 Stale IRQ ordering	Stale interrupt delivery after revoke cannot wake, acknowledge, or signal a new owner.	Interrupt generation mismatch is ignored, or the source is masked/detached/reset before reassignment. Hostile smoke revokes a driver while an interrupt is pending, reassigns the source, and proves the old waiter cannot wake against the new owner. Closed `2026-05-05 18:17 UTC` by `make run-net`’s `device-manager: interrupt handoff proof` line: real `INT $vector` injection across revoke, detach, and reset/reuse exercises the production IDT entry/handler/EOI path, asserts `s11_2_7_real_irq_injected_across_reset=ok`, `s11_2_7_old_waiter_cannot_wake_new_owner=true`, and `s11_2_7_stale_ack_blocked=true`, and is enforced by `tools/qemu-net-smoke.sh`. Userspace `Interrupt` waiter objects remain a future requirement for a full production-driver path.
S.11.2.8 Stale DMA completion ordering	Stale DMA completion after revoke cannot cause freed buffer reuse, stale CQ notification, or new-owner memory exposure.	Closed `2026-05-05 19:37 UTC` by `make run-net`’s `device-manager: dma completion handoff proof` line: real virtio-net DMA page free + reallocate cycle bumps the live page generation, then the production `device_dma::record_virtio_net_completion_for_allocation` path (the same function the live `Virtqueue::record_used_completion_for_allocation` invokes) is fed a stale `DeviceDmaAllocation` keyed to the live phys with a decremented generation, at three boundaries (after revoke, after detach, after reset/reuse). All three reject as `stale-dma-handle` with `side-effect-blocked`, queue accounting unchanged, live new-owner page preserved, no CQ publication, no new-owner exposure, and the freed-buffer slot remaining unchanged. The closure summary asserts `s11_2_8_real_completion_injected_across_reset=ok`, `s11_2_8_old_completion_cannot_publish_to_new_owner=true`, `s11_2_8_freed_buffer_reuse_blocked=true`, and `s11_2_8_accounting_underflow_blocked=true`, and is enforced by `tools/qemu-net-smoke.sh`. Prior acceptance text: in-flight DMA is accounted for, or device reset/disable completes before buffer reuse; hostile smoke covers revoke/reset with outstanding descriptors and proves no old completion can publish new-owner memory. S.11.2.9 hostile-smoke gate-wiring also closed `2026-05-05 20:49 UTC` (see the row below). Userspace `DMAPool` handles and real device-manager page quiesce/scrub/release hooks remain open as separate follow-ups.
S.11.2.9 Hostile-smoke coverage	QEMU/CI smokes cover stale handles, descriptor abuse, revoke races, stale IRQ after reset, stale DMA completion after reset, and exit-under-dma.	Smoke output has explicit closed-case proof lines for each above failure mode. Closed `2026-05-05 20:49 UTC` by aggregating the existing per-row proof lines into the `make run-net` -> `tools/qemu-net-smoke.sh` gate. Every matrix-row proof line has at least one assertion in the harness; the original two driver-crash assertions, the existing S.11.2.8 `device-manager: dma completion handoff proof` closure-summary assertion, and the S.11.2.7 `device-manager: interrupt handoff proof` closure-summary assertion (whose trailing anchor was added by this slice for harness-strictness consistency with S.11.2.8) all use the anchored extended-regex shape (field-by-field match plus `proof_result=ok[[:cntrl:]]?$` trailing anchor), and the other matrix-row rows reuse the harness’s pre-existing mix of unanchored extended-regex and fixed-string `grep -Fq` assertions. A `2026-05-08 09:44 UTC` follow-up adds anchored assertions for the cap-specific `dmabuffer-driver-crash-hook` and `dmapool-driver-crash-hook` proof lines; a `2026-05-08 10:08 UTC` follow-up adds anchored assertions and exact-one audit counts for the first cap-specific `devicemmio-reset-disable-hook` and `interrupt-reset-disable-hook` proof lines; a `2026-05-08 10:39 UTC` follow-up does the same for `dmabuffer-reset-disable-hook` and `dmapool-reset-disable-hook`; a `2026-05-08 13:42 UTC` follow-up (`aeef8b41`) adds the cap-specific `device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt` assertion plus an exact-one `cap-audit: cap=interrupt event=interrupt-waiter` count. Per-row coverage: stale DMA handle (`device-dma: stale dma handle proof`, `device-dma: live stale dma completion accounting proof`); descriptor abuse (`virtio-net: software descriptor generation model proof`, `virtio-net: invalid used descriptor id software-token proof`, `virtio-net: descriptor generation guard proof ok`, `virtio-net: invalid used descriptor id live software-token proof ok`, plus `device-dma: budget oom proof`); revoke/reset race (`device-manager: ownership proof`, the seven `device-manager: teardown trigger proof trigger=...` variants plus the final aggregate, `device-manager: dma completion handoff proof` for S.11.2.8, `device-manager: interrupt handoff proof` for S.11.2.7, the `device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio`, `device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt`, `device-manager: dmabuffer driver crash hook proof source=dmabuffer-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmabuffer`, `device-manager: dmapool driver crash hook proof source=dmapool-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmapool`, `device-manager: devicemmio reset disable hook proof source=devicemmio-reset-disable-hook ... trigger_path=trigger-reset-disable-for-devicemmio`, `device-manager: interrupt reset disable hook proof source=interrupt-reset-disable-hook ... trigger_path=trigger-reset-disable-for-interrupt`, `device-manager: dmabuffer reset disable hook proof source=dmabuffer-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmabuffer`, `device-manager: dmapool reset disable hook proof source=dmapool-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmapool`, and `device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt` lines, all requiring first-trigger `ok`, stale rerun `noop`, cap `validate_live=revoked`, post-trigger release `noop`, and `proof_result=ok` with cap-specific cleanup/evidence labels); stale IRQ after reset (S.11.2.7 closure summary, see row above); stale DMA completion after reset (S.11.2.8 closure summary, see row above); exit-under-DMA (`device-manager: teardown trigger proof trigger=process-exit owner=virtio-rng`, the teardown-trigger aggregate `triggers=cap-release,process-exit,driver-crash,reset-disable,interrupt-waiter,future-devicemmio,future-dmapool` line, the four cap-release-hook proofs each containing `process_exit_path=process-release-caps-for-exit`, plus `hardware-cap-release: ... reason=process-exit` count assertions). A `2026-05-23 21:34 UTC` follow-up adds the IOMMU production `DMAPool` hostile proof over the active mapped ledger, covering stale IOVA after revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion after reset, teardown-under-DMA ordering, cross-domain stale-handle attempts, and the fail-closed teardown branch proof; process-exit/exit-under-DMA remains the existing run-net bounce-buffer evidence. Production userspace `DeviceMmio`/`Interrupt` handles, broader non-proof device-manager page quiesce/scrub/release hooks outside the selected IOMMU smoke, hardware-backed provider-driver `Interrupt` wait/ack dispatch beyond the bounded route-dispatch waiter proof, and durable/signed production audit consumption beyond the first volatile `HardwareAuditLog.snapshot` cap remain open as separate follow-ups.

For each row, the transition requires an owner, implementation notes, and a CI-backed verification path. Until all rows pass, Phase 4.2 NIC/block drivers remain in-kernel for functionality, and only kernel-mapped bounce-buffer mode is allowed for prototype DMA.

Hostile-Smoke Acceptance Matrix

These smokes are the acceptance requirements for the userspace driver transition. The S.11.2.7, S.11.2.8, and S.11.2.9 rows are now backed by current make run-net QEMU evidence enforced by tools/qemu-net-smoke.sh (see the per-row “Closed” notes for closure timestamps and the proof-line shapes). The other matrix rows remain acceptance requirements for future implementation work; their proof lines are emitted by the kernel today and asserted by the same harness, but the production userspace handles, real device-manager page quiesce/scrub/release hooks, real userspace Interrupt waiter objects, IOMMU domain programming, and durable/signed production audit consumption beyond the volatile HardwareAuditLog.snapshot cap that complete each row’s full closure remain open as separate follow-ups.

Hostile case	Required setup	Closed-case proof expectation
Stale DMA handle	Allocate a DMA buffer, revoke or free it, advance the slot or pool generation, then attempt descriptor submission or buffer reuse through the old handle.	The operation fails closed on generation mismatch; no descriptor is made visible to the device, no DMA byte or buffer hold is restored, and any reused slot remains owned only by the new generation.
Descriptor abuse	Submit chains with out-of-pool addresses, stale or freed buffer slots, arithmetic wrap, misalignment, overlong segments, excessive chain length, or ring-depth overflow.	Validation rejects the chain before any doorbell write; the ledger shows no leaked descriptor hold, no in-flight increment without an owning buffer, and no access outside the pool range.
Revoke/reset race	Race revoke, reset, or process teardown against a driver that is submitting descriptors or ringing the device doorbell.	Revocation first invalidates handles and MMIO write authority; later submissions fail closed, existing in-flight records are either completed under the old generation or reset/disabled before page reuse, and teardown cannot skip to `DmaMappingsRemoved` while the ledger has live submissions.
Stale IRQ after reset	Leave an interrupt pending or a waiter blocked, reset or reassign the device/source, then deliver or acknowledge using the old generation.	The old waiter cannot wake against the new owner, stale acknowledgements do not affect the reassigned source, and the source is masked, detached, or generation-invalidated before reassignment. Closed `2026-05-05 18:17 UTC`: `make run-net` injects a real `INT $vector` through the IDT/handler/EOI path at three points across revoke, detach, and reset/reuse and records `s11_2_7_real_irq_injected_across_reset=ok`, `s11_2_7_old_waiter_cannot_wake_new_owner=true`, `s11_2_7_stale_ack_blocked=true`, plus matching `real_irq_inject_after_revoke_result=masked`, `real_irq_inject_after_detach_result=unregistered`, `real_irq_inject_after_reset_reuse_result=masked` on the kernel proof line.
Stale DMA completion after reset	Reset with outstanding descriptors, reuse or prepare to reuse pool slots, then inject or observe a completion from the old device generation.	The stale completion cannot publish a CQE to a new owner, cannot expose new-owner memory, cannot underflow accounting, and cannot make a freed buffer eligible for reuse unless reset/disable has proven old DMA stopped. Closed `2026-05-05 19:37 UTC`: `make run-net` walks a fresh device-manager record on the virtio-net BDF through the `Active>RevokingHandles>MmioRevoked>InterruptsDetached>QueuesQuiesced>Resetting>DmaMappingsRemoved>Dead` revocation path, exercises a real virtio-net DMA page free + reallocate cycle at three boundaries (after revoke, after detach, after reset/reuse), and feeds a synthesized stale `DeviceDmaAllocation` (live phys, decremented generation) to the production `device_dma::record_virtio_net_completion_for_allocation` path. Each boundary records `real_completion_inject_after_*_result=stale-dma-handle`, `_side_effect=side-effect-blocked`, `_queue_account_preserved=true`, `_live_page_preserved=true`, `_cq_publication_blocked=true`, `_new_owner_exposure_blocked=true`, `_freed_buffer_unchanged=true`, and `_generation_bumped=true`, plus a closure summary `s11_2_8_real_completion_injected_across_reset=ok`, `s11_2_8_old_completion_cannot_publish_to_new_owner=true`, `s11_2_8_freed_buffer_reuse_blocked=true`, `s11_2_8_accounting_underflow_blocked=true`.
Exit-under-DMA	Terminate or crash a driver process while it holds DMA buffers, MMIO mappings, interrupt waiters, and in-flight descriptors.	Process exit enters the device-manager teardown path, invalidates all user-visible handles, revokes MMIO, detaches interrupts, quiesces or resets queues, scrubs DMA pages before release, and reports a terminal ledger with no live holds for the old owner generation.

Security Verification Track S.11.2 Decision Record

Security Verification Track S.11.2 is backend-scoped. The current brokered-bounce userspace-provider path has enough reviewed evidence to close the retained DDF production-authority finding, but that closeout is not a general direct-DMA, hostile-hardware, or device-autonomous interrupt claim.

Current status: the brokered-bounce transition path is represented by done task evidence for DMAPool, DeviceMmio, and Interrupt lifecycle ownership, provider virtio-net/NVMe chains, and hardware-audit consumption of abort-held DMA mappings. The broader S.11.2 matrix remains the canonical gate for future direct-remapping/vIOMMU, trusted-sharing-group, hostile-hardware-isolation, or provider-written-address work. This document fixes the production handle epoch invariants, DMAPool ledger of record, and hostile-smoke acceptance criteria used by the completed Device Driver Foundation documentation gate. The current QEMU virtio-net path has a kernel-owned DMA pool ledger for page, descriptor, MMIO mapping, and interrupt-hold accounting proof coverage plus static IOMMU attachment-policy reporting for retained DMA-capable PCI functions and the bounded teardown trigger contract proof, bounded kernel-owned budget/OOM proof, manager-bound DMAPool budget-profile proof plus bounded budget-policy tamper and accounting-over-budget fail-closed proofs, bounded manager-owned DeviceMmio proof adapter bound to decoded PCI memory-BAR metadata plus future cache/write-policy metadata, bounded zero-live device-manager DMAPool record lifecycle proof, and imported live-accounting block/defer proof plus zero-live teardown-evidence scratch proof, stale DMA handle scratch proof, stale DMA completion scratch proof, paired scratch CQ-publication/new-owner-exposure proof, live software descriptor-generation guard proof, bounded invalid used-descriptor-id proof, and bounded stale IRQ after-detach, counter-backed after-revoke, counter-backed route-registry reset-reuse, and pending IRQ token checks described above. The bounded pure capos-lib::device_authority validator and host tests cover the documented identity, state, side-effect-blocking, non-wrapping epoch cases, and every current operation variant’s exact blocked side-effect label for stale owner/subrecord, freed, revoked, and retired failures. The zero-live device-manager DMAPool lifecycle proof now validates a proof-scoped tampered budget-policy record through the manager policy helper and records fail-closed, no fake allocation, no ledger mutation, no teardown advancement, and side-effect blocking while preserving the positive budget_policy_result=ok path. The positive zero-live and imported-live budget-accounting labels now go through the manager-owned active-record helper, and synthetic over-budget attached-accounting candidates fail closed with exact reasons while preserving the active manager record and blocking allocation, ledger, teardown, and side effects; an over-budget attach candidate fails before pool generation allocation. It also records a bounded manager-attached DMA buffer handle under the attached pool, validates active SubmitDescriptor and manager-record CompleteDescriptor through the pure DMA-buffer validator, and records stale-after-revoke, freed-buffer, and reused-slot rejection with exact reasons and side-effect-blocked; it now also blocks pool teardown as dmapool-buffer-attached, rejects a stale same-slot proof-scoped FreeBuffer as dmabuffer-stale-handle with stale-slot-generation and side-effect-blocked, rejects wrong-owner-generation, wrong-pool, wrong-pool generation, and wrong-buffer-slot FreeBuffer attempts with exact pure validator reasons and side-effect-blocked, preserves that manager-owned buffer record after each failed free, and clears the record only after a proof-scoped active FreeBuffer validation, proof-page scrub/free, and manager-owned buffer-record detach. The completion proof does not publish a CQ entry, complete a real descriptor, grant userspace authority, or clean up or reuse production userspace DMA pages. The live virtio-net queue-completion path now gates completion accounting on the completed descriptor’s DeviceDmaAllocation rather than the queue id alone: callers must validate the used descriptor id, recover the matching DmaPage, and pass its physical address, queue, label, and generation to the kernel-owned ledger before in-flight accounting is decremented. The paired run-net proof records that a stale generation for a live kernel-owned page fails as stale-dma-handle, leaves queue accounting and the live page unchanged, and blocks CQ publication plus new-owner exposure. This closes a live accounting prerequisite only; it does not inject a real post-reset device completion or expose userspace DMA authority. The live virtio-net used-ring path also carries bounded software descriptor generations: submissions reject invalid or already-active descriptor ids before accounting, completions must consume the active software token exactly once, and the run-net proof records side-effect blocking for active reuse, double completion, and an old software token after descriptor-id reuse. That guard does not make a stale hardware used-ring id distinguishable after deliberate id reuse because virtio used entries carry no device generation. The same gate now also covers invalid used-descriptor ids without corrupting the hardware ring: an out-of-range id fails as descriptor-id-out-of-range before completion observation, completion accounting, used_seen_idx, CQ publication, or new-owner exposure can change. This is still a software-token and constructed-token prerequisite, not a real malformed-device or post-reset completion injection. The same zero-live proof now also constructs the result-only DMAPool.info cap skeleton from the manager-issued DmaPoolHandle, validates the active manager record before returning conservative status labels plus numeric device/BDF/owner/pool identity fields, proves the serialized cap call path decodes to those labels and identity fields with host physical exposure off and direct DMA blocked, and proves the cap’s info path fails closed as dmapool-stale-handle after revoke begins. It also exercises DMAPool.allocateBuffer through call_with_table() on a real cap-table entry, returns zero-indexed DMABuffer result caps for eight fixed manager-owned bounce-buffer slots, validates those result caps’ DMABuffer.info, and proves a ninth allocation fails through the manager-owned budget policy as dmapool-budget-exceeded / over-buffer-budget before publishing another result cap or corrupting live slot state; full-pool allocation also preserves manager generation counters. Stale-after-revoke allocations still fail closed without publishing another result cap. The same zero-live proof constructs the result-only DMABuffer.info cap skeleton from the manager-attached DmaBufferHandle, validates the active manager-owned buffer record through the pure DMA-buffer validator before returning conservative no-authority labels plus numeric device/BDF, owner/pool/slot identity fields, proves the serialized cap call path decodes to those labels and identity fields with host physical exposure off and direct DMA blocked, and proves the cap’s info path fails closed as dmabuffer-stale-handle after revoke begins; the same stale cap’s serialized method-0 path fails as invoke-failed. The first DmaBufferCap release hook now reuses the bounded FreeBuffer validation shape to clear only the manager-attached proof_buffer record during cap-table removal, production ring CAP_OP_RELEASE, and real Process::release_caps_for_exit() paths. It proves stale same-slot release is side-effect-blocked, proves the parent DMAPool remains attached after buffer release, proves the bounded manifest grant can allocate the slot again after explicit freeBuffer with a fresh slot generation, and still requires staged zero-live evidence before the parent pool can detach. The selected provider-TX path now adds a bounded exception to the default manager-accounting descriptor contract: queue 1 submits may publish the selected eight-entry TX queue depth, descriptors 0..7, into the existing kernel-owned virtio-net TX ring before the first completion, ring one selected notify doorbell per accepted provider descriptor, and then complete each descriptor only after DMABuffer.completeDescriptor observes the matching used-ring entry for the stored software descriptor generation. Those handoffs clear the matching manager in-flight records, record bounded provider CQ completion and acknowledgement counts, and can deliver ordered bounded completion events to live tx_interrupt.wait calls for the same selected route. The selected provider-TX path also proves a teardown-only drain when one descriptor has completed and seven provider-published descriptors remain incomplete: direct DMABuffer.freeBuffer remains blocked while in flight, release explicitly drains only the incomplete matching used-ring entries and retires those allocation-backed TX DMA queue ledgers without DMABuffer.completeDescriptor results, no provider CQ/IRQ event is published for the quiesced descriptors, release retires seven delivered-but-unacked completion events, and later slot reuse requires a fresh generation plus normal completion. Wrong-queue, stale-buffer, stale-notify, inflight-publication, wrong-descriptor, duplicate-completion, and stale-tx_interrupt issue paths remain side-effect-blocked before their guarded effects. This does not grant direct DMA, arbitrary doorbells, arbitrary CQ ownership outside the selected TX route, full virtio-net ownership, production NIC/storage migration, IOMMU programming, hardware IRQ ownership, hardware acknowledgement, or broad interrupt ownership beyond the bounded selected TX MSI-X mask/unmask proof. The bounded DeviceMmio proof also records the manager-attached policy metadata listed above, fails closed on a tampered cache/write-policy record before creating any mapping, and validates active hostile handle identities for wrong owner generation, wrong mapping generation, wrong mapping id, wrong BAR, and wrong BDF/device with exact pure-validator reasons while preserving the attached record and blocking mapping/doorbell side effects. Its serialized cap call path also decodes to the direct DeviceMmio.info no-authority labels plus numeric device/BDF, owner, BAR, mapping id, and mapping generation identity fields with host physical exposure off and direct MMIO blocked, and its stale serialized method-0 path fails as invoke-failed. The DMAPool.info skeleton has the same kernel-side serialized stale failure evidence. The interrupt handoff proof now also constructs a result-only Interrupt.info cap skeleton from the manager-issued device handle and attached route record, records active info success, proves the serialized cap call path decodes to the direct no-authority labels plus numeric device/BDF, owner, source, source generation, and route generation identity fields, proves those source and route generations are distinct in the bounded route record, and proves stale-after-revoke info fails closed as interrupt-stale-handle plus stale serialized method-0 failure as invoke-failed before any acknowledgement, mask, unmask, blocking wait, or delivery authority exists. The manifest-granted skeleton now also exposes an admission-only Interrupt.wait method that returns the pending-token validator’s fail-closed labels without waking a waiter or changing delivery counts, and an admission-only Interrupt.acknowledge method that validates the active route while blocking hardware acknowledgement and preserving delivery counts. It also exposes route-state-control Interrupt.mask and Interrupt.unmask methods that validate the active route before changing the manager-attached dispatch slot between claimed-masked and driver-unmasked, while preserving delivery counts. A bounded Interrupt.wait call observed after unmask installs a fixed-table userspace waiter object for the current manager-granted route; the existing route-dispatch delivery counter can now complete that waiter as interrupt-delivered / waiter-completed-irq with real_interrupt_delivery=delivered and an advanced delivery count. The same focused smoke then submits a second unmasked wait, observes it remains pending, calls Interrupt.mask, and finishes that wait as interrupt-waiter-cancelled / route-masked / waiter-completed-no-irq with wake_blocked=false, preserved source/route generations, and unchanged delivery counts. The selected provider TX tx_interrupt cap can now observe the bounded used-ring completion event described above and account the already observed selected TX dispatch token paired with that delivered provider CQ event, but hardware MSI/MSI-X programming beyond the selected vector-control proof, full hardware IRQ ownership, deferred EOI, LAPIC/MSI-X acknowledgement, and broader production interrupt dispatch remain blocked. Provider TX MSI-X mask/unmask is limited to the selected-route vector-control proof described earlier. Provider RX MSI-X mask/unmask remains bounded to the selected RX route as well; release while masked restores that selected vector-control bit and route state before clearing the live issue gate. RX unmask admits the route transition before exposing the MSI-X vector-control bit, and the focused QEMU proof shows a failed route unmask leaves the selected vector masked with the route ledger preserved. Cleanup failure still leaves the issue uncleared so future RX cap issuance stays blocked on uncertain route state. RX wait/ack is now bounded to one selected-route zero-CQ dispatch token; RX descriptors and CQ ownership remain blocked. This is manager-record skeleton/no-production-DMA, no-real-MMIO-mapping, and bounded route-dispatch interrupt-waiter prerequisite evidence only. Production DMAPool, DeviceMmio, and Interrupt capability handles, production userspace DMAPool buffer handles, real DeviceMmio BAR mapping objects, real cache attributes/write policy enforcement, production kernel device-path wiring beyond the current proof adapters, real device-manager page quiesce/scrub/release hooks and real page cleanup/reuse beyond the bounded kernel-owned proof pages, production handle-attached budget/OOM enforcement beyond the current manager-owned DMAPool.allocateBuffer budget slice, IOMMU remapping domains, production handle-attached host tests, QEMU stale-handle smokes, broader userspace exposure, production NIC/storage migration, cloud readiness, and S.11.2 hostile smokes remain open.

Do not weaken the short-term virtio-net bounce-buffer path until DMAPool, DeviceMmio, Interrupt, device-manager ownership transactions, lifecycle teardown, accounting, and hostile smokes all exist.

Keyboard shortcuts

capOS Documentation