DMA Isolation Design
Security Verification Track S.11 gates PCI, virtio, and later userspace device-driver work on an explicit DMA authority model. The immediate goal is narrow: let the kernel bring up a QEMU virtio-net smoke without creating a user-visible raw physical-memory escape hatch.
Short-Term Decision
Use kernel-owned bounce buffers for the first in-kernel QEMU virtio-net smoke.
The first virtio-net smoke stays on this conservative path:
- kernel-owned DMA pages
- kernel-owned virtqueue descriptor tables
- kernel-owned packet buffers
- kernel-programmed physical addresses
- copied packet bytes delivered to the network stack
- no DMA buffer capability exposed to userspace
- no physical address exposed to userspace
- no virtqueue pointer exposed to userspace
- no BAR mapping exposed to userspace
The kernel allocates DMA-capable pages from its own frame allocator, owns the virtqueue descriptor tables and packet buffers, programs the device with the corresponding physical addresses, and copies packet payloads between those buffers and the networking stack.
This is deliberately conservative:
- It works before ACPI/DMAR or AMD-Vi parsing, IOMMU page-table management, MSI/MSI-X routing, and userspace driver lifecycle supervision exist.
- It keeps all physical-address programming inside the kernel, where the same code that allocates the frames also bounds the descriptors that reference them.
- It does not make the current
FrameAllocatororMemoryObjectcapability part of the DMA path.FrameAllocatorno longer exposes raw physical addresses, but DMA still needs device-owned buffer objects with IOVA and reset/revoke semantics rather than repurposed general memory caps. - It gives the smoke a disposable implementation path. When NIC or block
drivers move to userspace, bounce-buffer authority becomes a typed
DMAPoolobject instead of an ad hoc physical-address grant.
An IOMMU-backed DMA-domain model remains the target for direct device access from mutually untrusted userspace drivers, but it is not a prerequisite for the first QEMU smoke. Without an IOMMU, a malicious bus-mastering device can still DMA to arbitrary RAM at the hardware level; the short-term smoke assumes QEMU-provided virtio hardware and protects against confused or untrusted userspace, not hostile hardware.
IOMMU Staging
IOMMU support is a deferred-with-known-dependency prerequisite for production hardware claims and for moving direct DMA-capable NIC or block drivers into userspace. capOS now discovers bounded ACPI IOMMU table summaries for Intel DMAR and AMD-Vi/IVRS and records static DMAR DRHD include-all or single-hop PCI endpoint device-scope coverage for retained DMA-capable PCI diagnostics functions. Bridge and multi-hop scopes are retained for diagnostics but do not prove endpoint attachment until PCI topology traversal exists, and include-all fallback fails closed when retained DRHD units or scopes are capped.
The selected QEMU Intel remapping path now programs VT-d root/context and
second-level tables for manager-owned DMAPool pages, reports bounded fault
state, exports only domain-scoped IOVAs, and proves two claimed DMA-capable
functions receive distinct per-device domains and second-level roots. It also
asserts the production-path S.11.2 hostile-smoke matrix over the active
DMAPool / DMABuffer ledger. The decomposed integration umbrella for this
path closed 2026-05-23 23:35 UTC
(ddf-iommu-remapping-production-closeout).
This is still QEMU-only evidence for the
selected path, not a general production
hardware-isolation claim: trusted sharing groups, AMD-Vi programming, and
production NIC/block userspace driver authority remain future work, and VM
shapes without usable remapping hardware remain on the explicit bounce-buffer
fallback.
The discovery parser is intentionally shallow and follows the static-table
formats documented by the Intel VT-d architecture specification, the AMD IOMMU
specification, and QEMU’s q35-only -device intel-iommu emulation:
Future real remapping work is grounded by the primary-source IOMMU remapping research note, which records Intel VT-d, AMD-Vi, and QEMU sections relevant to table programming, invalidation, fault/status diagnostics, and QEMU-only smoke tests. That note is source grounding only; it does not make the current diagnostics path a real remapping implementation.
- Intel VT-d architecture specification: https://www.intel.com/content/www/us/en/content-details/671081/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
- AMD IOMMU specification: https://docs.amd.com/v/u/en-US/48882_IOMMU
- QEMU manpage: https://www.qemu.org/docs/master/system/qemu-manpage.html?highlight=numa
The staged implementation order is:
- Discover firmware IOMMU topology from ACPI static tables and fail closed if the tables are malformed, unsupported, or inconsistent with the PCI root complex being used. This first bounded table-discovery step is implemented for DMAR/IVRS summaries only; domain attachment is still planned.
- Record each DMA-capable PCI function’s attachment to an IOMMU unit, or explicitly keep the function on the prototype bounce-buffer-required policy when no trusted IOMMU domain can be created. This reporting step is implemented for retained PCI diagnostics functions when DMAR DRHD include-all or single-hop PCI endpoint device-scope metadata proves PCI segment/BDF coverage. Bridge and multi-hop scopes are not treated as attachment proof until PCI topology traversal exists, and include-all fallback fails closed when retained DMAR coverage metadata is capped; trusted domain creation is still planned.
- Define and prove the claimed-device domain policy: one device-manager-owned DMA domain per claimed device or trusted sharing group, with all exported device addresses represented as IOVAs scoped to that domain rather than host physical addresses. The selected QEMU Intel path now implements the per-device form for two claimed DMA-capable functions; trusted sharing groups remain disabled and out of scope.
- Attach
DMAPoolallocation, descriptor validation, MMIO ownership, interrupt ownership, and revocation state to the same device-manager ledger before any doorbell write can make a descriptor visible to hardware. - On revoke, reset, or driver death, stop new submissions, remove or invalidate IOMMU mappings before page reuse, and flush the relevant IOTLB state where the hardware model requires it.
Until those gates exist, direct DMA and userspace driver handoff remain blocked. Devices that cannot be placed in a trusted IOMMU domain must stay on kernel-owned bounce buffers or remain unsupported for production claims. This also affects the hostile-smoke gate: S.11.2 smokes must prove that stale DMA handles, stale completions, reset races, and teardown ordering fail closed for IOMMU-backed IOVA mappings, while the process-exit / exit-under-DMA rows remain covered by the selected backend evidence before a cloud or hardware driver can be treated as isolated from the rest of memory.
Fallback Policy For No Usable IOMMU Exposure
Some providers or VM shapes may not expose remapping hardware that capOS can trust. That includes absent, malformed, unsupported, capped, or incomplete DMAR/IVRS metadata; scopes that require PCI topology traversal capOS has not implemented yet; and platforms where remapping hardware is unavailable or cannot be programmed safely. Those shapes use a fail-closed fallback policy:
- Direct device DMA remains blocked.
direct_dma_trusted_domainsstays zero andremapping_tablesstaysnot-programmed. - Prototype devices that remain enabled use kernel-owned bounce buffers only.
The kernel or device manager owns the pages, descriptor validation,
physical-address programming, and packet or block-data copies between
device-visible memory and non-device memory. General
FrameAllocatorandMemoryObjectcapabilities are not DMA authorities. - capOS does not expose direct hardware authority for userspace
DMAPool,DMABuffer,DeviceMmio, orInterruptin the fallback shape. Result-only.infoskeletons and bounded manifest grants may report conservative status. The currentDMAPoolmanifest grant may allocate and free eight fixed manager-attached, kernel-owned, single-page bounce-bufferDMABufferresult caps, with backing pages scrubbed before frame release and no host physical address or IOVA exposed. That narrow fixed-slot allocation/free authority does not map DMA, program device-visible addresses, publish arbitrary CQ entries, program IOMMU/remapping tables, access arbitrary BAR registers or doorbells, or own hardware interrupt acknowledgement, mask, or unmask. The selected provider-TX proof is the current bounded exception: after the same manager-ownedDMABufferauthority and bounce-scrub gates, queue1may publish the full selected TX queue-depth descriptor/avail window into the existing kernel-owned virtio-net TX ring before the first completion, ring one selected notify doorbell per accepted provider descriptor through the live no-writenotify_mmiopolicy, and hand those bounded completions back through descriptor/generation-matchedDMABuffer.completeDescriptorplus livetx_interrupt.waitcompletion events. The same selected path can also usetx_interrupt.mask/unmaskto toggle only the selected TX MSI-X table vector-control bit and matching route state after live issue-id and route validation, and can retire one deferred LAPIC EOI for each delivered selected TX used-ring completion event, withInterrupt.acknowledgereturning ABI-visible provider CQ/ack ledger fields plus hardware dispatch ack count, delta, token, and mutation flag for that bounded pairing. Full-queue QEMU bursts that coalesce selected TX MSI-X delivery use a boundedINT $vectorproof hook only while the virtio TX completion path has an active full-window coalescing budget, so the selected IDT handler and deferred-EOI path remain observable without claiming full production IRQ ownership. Successful selected queue1DMABuffer.completeDescriptor,tx_interrupt.wait, andtx_interrupt.acknowledgeresults also carry bounded CQ event identity: sequence, queue, descriptor id, slot, slot generation, software descriptor generation, completion length, provider issue id, source id/generation, and route generation. Pre-event, duplicate ack, masked-route ack, wrong-order completion, teardown-drain, stale issue after release/regrant, reset, and stale-after-release paths keep that identity empty and do not mutate the bounded identity queue. Provider TX release also retires delivered but unacknowledged bounded CQ events for the live issue before clearing that issue: the stale post-release ack path is revoked, and the release proof records seven pending provider completion acks and their deferred EOIs as release-retired. The same selected path also has a bounded teardown-only drain for seven incomplete provider-published TX descriptors while one completed descriptor remains live: release may explicitly drain only the incomplete matching used-ring entries, retire those allocation-backed device-DMA TX queue ledgers, and free only after manager in-flight state is drained, without publishing provider CQ/IRQ events or issuingDMABuffer.completeDescriptorresults. The paired provider RX bootstrap grant can now validate the live RX issue and selected virtio-net RX route before toggling only the selected RX MSI-X table vector-control bit and route state, and it can complete one selected-route RXInterrupt.waitafter a delivered RX MSI-X/LAPIC dispatch. The pairedInterrupt.acknowledgeaccounts exactly one RX dispatch token and retires one deferred LAPIC EOI for that delivered zero-CQ RX event; pre-event, masked-route, duplicate, and stale-after-release paths fail closed without mutating delivery or acknowledgement state. RX descriptor accounting and RX CQ ownership remain bounded to the synthetic proof path, and full hardware IRQ ownership remains blocked. These exceptions do not transfer full virtio-net ownership, direct DMA, IOMMU authority, arbitrary doorbells, production NIC/storage authority, or cloud readiness. - capOS does not claim hostile-hardware isolation for those shapes. A malicious or compromised bus-mastering device without a trusted remapping domain can still write arbitrary RAM at the hardware level. The fallback is acceptable only for prototype devices and trusted emulator or provider shapes where that hardware threat is outside the claim; otherwise the device remains unsupported.
- Before any userspace driver path can rely on DMA or IRQ authority, S.11.2 hostile smokes must pass for the selected backend. That includes stale DMA handles, stale completions, descriptor abuse, revoke/reset races, stale IRQs, teardown-under-DMA for IOMMU-backed IOVA mappings, and exit-under-DMA for the fallback bounce-buffer path when the fallback is used.
This fallback policy is separate from current diagnostics-only IOMMU metadata
coverage and from future real remapping-domain integration. Diagnostics can
report static firmware-table coverage for a PCI function, but unless capOS
creates a device-manager-owned remapping domain and programs mappings, the
active direct-DMA policy remains blocked. Future real integration must attach
DMAPool, DeviceMmio, Interrupt, ledger teardown, mapping removal or
invalidation, and required IOTLB flushes to the same ownership transaction
before a direct-DMA trusted-domain count can become nonzero.
DMA Assurance Model Checked Evidence And Cloud Backend Inputs
The DMA assurance model records the claim boundary and checked bounded evidence
for DMA authority; the cloud backend contract it feeds is authoritative and
lives in the “Cloud DMA Backend” section below:
DMA Assurance Model
and models/dma/. It is a design/evidence scaffold, not a new production
hardware gate by itself. The checked gates are make model-dma-tla,
make model-dma-alloy, make kani-dma-authority, and
make model-dma-deferred-completion-loom; make dma-assurance-model-check
aggregates them locally, while GitHub CI runs the Alloy/TLA+/Loom gates in
dma-assurance-models and the Kani gate in kani-proofs. The
operationalization track that reconciled the skeletons against landed DMA code
is tracked in
Security and Verification
(“DMA Assurance Model Operationalization”).
Cloud NIC/storage work must use the model as the checklist for backend selection. Backend selection is a runtime, fail-closed decision the kernel makes on each boot, with an optional operator override declared in the system manifest; it is not a per-VM-shape safety assertion that a person signs off. The authoritative selection rule and the manifest override contract are defined in the “Cloud DMA Backend” section below.
Cloud backend evidence must separate provider-side DMA isolation from guest-controlled remapping authority. SR-IOV, virtual NIC, GPU, accelerator, or local NVMe support can identify a DMA-capable surface, but it is not enough to claim direct-DMA isolation. A direct-remapping backend needs guest-visible IOMMU or equivalent translation authority that capOS can discover and program. The cloud evidence matrix must record provider API or documentation sources, retrieval date, region or zone, instance type, image and kernel, live guest PCI/device probes, IOMMU table/group observations, and maintenance or device revocation behavior as the support-policy record for advertised targets. The runtime probe, not this matrix, makes the binding per-boot selection.
The matrix does not replace runtime selection. capOS must choose the safest
backend on each boot from what it can actually observe and validate. Direct
remapping is enabled only when guest-programmable remapping authority is
present and passes the selected self-tests. A provider-remapped or bounce path
is selected only when direct DMA remains blocked and device-visible memory can
stay manager-owned. Ambiguous, contradictory, or unvalidated observations select
Unsupported.
The backend candidates are:
- Direct remapping domain. The provider shape must expose guest-programmable remapping hardware; capOS must discover and program a device-manager-owned domain for the target device; descriptor publication must be ordered after mapping; and teardown must remove mappings, observe required invalidation completion, and scrub before page reuse. The selected path must carry stale-handle, stale-completion, descriptor-abuse, revoke/reset-race, teardown-under-DMA, no-host-physical-exposure, and cross-domain alias evidence.
- Labeled bounce-buffer fallback. Direct DMA stays blocked, device-visible
memory remains manager-owned bounce pages, host physical addresses and
generic
MemoryObjectauthority stay hidden from the driver, and stale handle/completion/teardown evidence covers the selected fallback. This path must keephostile_hardware_isolation=not-claimedunless separate per-domain remapping evidence justifies a stronger provider-specific claim. - Unsupported. Devices whose DMA behavior cannot satisfy either candidate stay unbound or disabled. A serial boot result or PCI enumeration line is not enough to claim cloud NIC/storage readiness.
Downstream cloud driver preflights must declare the candidate backend and map their evidence to the assurance model’s invariants: no host-physical exposure, mapping before publication, no page reuse before teardown, stale-handle and stale-completion fail-closed behavior, domain-scoped aliasing only, bounded fail-closed holds, and explicit backend evidence. The evidence matrix is a support-policy record of advertised targets; the runtime probe, not the matrix, selects the backend on each boot.
Cloud DMA Backend
This section is the authoritative contract for how capOS selects a DMA backend for cloud NIC/storage devices. Selection is a runtime, fail-closed decision the kernel makes on each boot from what it can actually probe and validate, with an optional declarative override in the system manifest. There is no human sign-off in the selection path: the runtime probe decides by default, and the manifest override is config that an operator sets for a deployment, not a doc-signing ritual gated on any specific person. Downstream cloud NIC/storage driver slices consume this contract directly as their DMA-backend authority.
The preceding “DMA Assurance Model Checked Evidence And Cloud Backend Inputs” section defines the three backend candidates; this section adds the per-candidate trade-off analysis, the runtime selection rule, the manifest override field, and the downstream-contract scaffolding that a cloud NIC/storage driver declares. The research substrate is the provider evidence inventory Cloud DMA Provider Evidence Inventory, and the invariants and tool mapping are in DMA Assurance Model.
Provider-Written Addresses And No-IOMMU Brokered Bounce
Two DMA-address ownership models can be valid, but they do not apply to the same backend.
- Provider-written, kernel-validated addresses (the NVMe Model B validator) are valid only when the provider’s device-visible address is not a host physical address: a verified direct-remapping/vIOMMU domain-scoped IOVA, or a future synthetic software address namespace that the manager translates before hardware sees it.
- Brokered address publication is the no-IOMMU bounce-buffer model. The provider may own protocol state and buffer capabilities, but the kernel or device manager writes device-visible queue-base, PRP/SGL, or virtqueue address fields because those values are host physical or bus addresses on current no-IOMMU hardware.
Correction recorded 2026-05-27: the earlier reconciliation that treated a
no-IOMMU bounce window as a provider-visible, non-host-physical device address
space is not valid for the current implementation. On the run-pci-nvme
no-IOMMU shape, DeviceDmaAllocation carries host physical pages and the
reviewed IOVA export discipline keeps userspace IOVA/host-physical export
disabled. Therefore a provider-written NVMe queue base or PRP on that gate would
export a host physical address, violating the no-host-physical-exposure
invariant. A bounce buffer protects data ownership and copy discipline; it does
not create an untrusted-driver-safe IOVA namespace by itself.
The kernel on-notify DMA validator
(kernel/src/cap/nvme_doorbell_validator.rs, validate_doorbell_scan) remains
useful evidence for the provider-written model. On a queue-arm/CC.EN write and
on an SQ tail doorbell it scans the device-visible addresses the provider
published (queue bases; PRP1/PRP2 and one level of PRP-list indirection) and
fails closed before the doorbell takes effect on any address that is not wholly
within a window granted to the doorbell claim’s owner at the live generation:
out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned,
deeper-than-one-level PRP chain, or stale generation (ScanReject). Owner
identity and live generation come from the grant ledger, never from
provider-supplied metadata. A completion whose submission scan was never
validated, or was validated under a now-retired generation, does not wake a
waiter (completion_wakes_waiter), matching the stale-completion gate on the
virtio-net path. That mechanism is the right fit for the QEMU direct-remapping
lane and any future cloud shape that exposes guest-programmable remapping.
For the current GCP/no-IOMMU target, the storage path must use brokered bounce:
userspace supplies typed commands, queue ownership intent, and live
DMABuffer/buffer-cap handles; the manager materializes the actual
device-visible queue-base and PRP/SGL fields and orders publication, teardown,
copy, and scrub. That still leaves protocol-specific NVMe logic in userspace,
but it does not let userspace author raw device addresses.
The brokered admin-queue enable landed 2026-05-27
(nvme-no-iommu-brokered-controller-enable,
device_manager::nvme_brokered_admin_queue_enable). The provider allocates the
admin submission/completion queue pages through its DMAPool cap and requests
enable through the CC selected-write claim (CC.EN set); the manager resolves
those pages from the live ledger (proof_buffers slots 0/1), validates the
authored bases through validate_doorbell_scan (ScanKind::QueueArm), and
authors AQA/ASQ/ACQ plus the CC.EN write itself. The provider never
receives the host physical / device-visible queue-base address; CSTS.RDY=1 is
observed only through brokered reads. This is the brokered model applied to the
admin queue-arm; the steady-state SQ-tail doorbell over provider-written PRPs
still needs the direct-remapping/synthetic-address lane above. Proof
make run-pci-nvme; provenance docs/devices/nvme.md §6.
Provider-Side Isolation Versus Guest-Programmable Remapping
The decisive distinction for backend selection is between a DMA-capable surface and guest-programmable remapping authority:
- SR-IOV (AWS ENA, Azure Accelerated Networking VF), a virtual NIC (gVNIC, virtio-net), a GPU/accelerator, or local NVMe identifies a device that does or could bus-master. This is a DMA-capable surface, not a safety property.
- A direct-remapping classification requires a usable Intel VT-d, AMD-Vi, or Arm SMMU unit that the guest can discover, program, and validate, with translation/fault/invalidation behavior matching IOMMU Remapping Grounding. A DMA-capable surface alone never implies this.
- Provider-side isolation facts (host-enforced VPC isolation, Nitro/host data-path bypass, hypervisor-side IOMMU) are support-policy assumptions a guest cannot prove from inside, not evidence that capOS can safely program direct DMA.
Runtime probing is authoritative for selecting the safe backend on a particular
boot: capOS chooses from the device inventory, the remapping authority it can
actually program, driver self-tests, and fail-closed probe results, and unknown
or contradictory observations select the labeled bounce-buffer path or
Unsupported. The cloud VM evidence matrix is the separate support-policy
record for advertised targets and provider assumptions a guest cannot fully
prove by itself; it does not override the boot-time runtime selection.
Candidate Trade-Off Analysis
| Dimension | Direct remapping domain | Labeled bounce-buffer fallback | Unsupported |
|---|---|---|---|
| IOMMU coverage requirement | Requires a guest-programmable VT-d/AMD-Vi/SMMU unit capOS can program and validate per device. | None: used precisely when no usable guest IOMMU is exposed. | N/A: device stays unbound. |
| Cloud VM shape coverage (per inventory) | No probed GCE shape exposes a guest-programmable IOMMU; AWS/Azure shapes not yet probed. So no probed shape currently qualifies. | Indicated for shapes with a DMA-capable surface but no guest IOMMU (the probed GCE rows); fail-closed default for unproven shapes with a manager-ownable surface. | Ambiguous, contradictory, or unvalidated observations. |
| Per-operation cost | Translation only; no data copy. IOTLB/context-cache invalidation on teardown. | Copy between device-visible bounce pages and non-device memory on every transfer; Confidential VMs force this in hardware regardless. | None. |
| Hostile-smoke coverage today | Bounded QEMU Intel path only (make run-iommu-remapping, ddf-iommu-remapping-production-closeout); no cloud guest-IOMMU evidence. | S.11.2.7/8/9 rows enforced by the make run-net gate (tools/qemu-net-smoke.sh), with bounce-buffer virtio-net provider evidence in ddf-provider-virtio-net-driver-closeout; bounce-buffer DMAPool lifecycle by make run-dmapool-grant. The GCP-shape local binding precursor (cloud-gcp-virtio-net-local-qemu-binding) asserts, in both make run-net and make run-ddf-provider-consumer, that the enumerated/bound function matches the documented GCP 1st/2nd-gen virtio-net device surface (standard virtio-net, vendor 0x1af4) and that the resolved backend is this labeled bounce-buffer path; it does not claim live GCP enumeration. | N/A. |
| Hostile-hardware isolation claim | Claimable only with per-domain remapping evidence and the IOMMU hostile smokes; not yet established for any cloud shape. | not-claimed: a malicious bus-mastering device without a trusted remapping domain can still write arbitrary RAM. | N/A. |
The GCE live-probe rows in the evidence inventory record that every probed GCE
shape (1st-gen n1, 2nd-gen e2, 3rd-gen Intel c3, and AMD-SEV Confidential
n2d) boots with intel_iommu=off, DMAR: IOMMU disabled, SWIOTLB software
bounce buffering, empty /sys/kernel/iommu_groups, and no DMAR/IVRS/IORT
table; the Confidential shape forces bounce buffering as a memory-encryption
invariant. These rows are support-policy expectations, not a hardcoded
selection table: they describe what capOS’s runtime probe should expect to find
on those shapes today, and they confirm that the fail-closed default lands on
the labeled bounce-buffer path there. The boot-time probe, not this matrix,
makes the binding selection on each boot, so a shape whose IOMMU exposure
changes is handled by the probe re-evaluating rather than by editing this text.
AWS and Azure shapes carry no live-probe evidence yet; the probe treats them
the same as any other unproven platform and defaults to the bounce-buffer
path.
Runtime Selection Rule (Fail-Closed Default)
On each boot, capOS probes the platform for guest-programmable remapping authority — IOMMU presence, programmability, and coverage for the DMA-capable functions it intends to bind — and selects the backend fail-closed:
- Probe the platform. Discover DMA-capable functions, then test whether a usable Intel VT-d, AMD-Vi, or Arm SMMU unit is present, discoverable, and programmable, and whether its translation/fault/invalidation behavior passes the self-tests in IOMMU Remapping Grounding.
- Select fail-closed. Select the direct remapping domain backend for a device only when the probe positively verifies a usable+safe IOMMU for that device. If the probe cannot verify it — IOMMU absent, not programmable, coverage unproven, self-test failed, or observations ambiguous — select the labeled bounce-buffer fallback for any DMA-capable surface the manager can keep manager-owned, or Unsupported when even that cannot hold. This probe-gated rule is the default: on an unproven platform the probe cannot verify, so direct DMA is not used and the bounce-buffer path is chosen.
- There is no human in this loop. The machine decides per boot; the only external authority is the optional manifest override below.
This is the boot-time authority. The cloud VM evidence matrix above is the support-policy expectation of what the probe should find, not the decision itself.
Manifest Override Field (Operator Authority Lever)
An operator can override the runtime default for a deployment through one
declarative, auditable enum field in the system manifest’s kernel parameters:
the dmaBackendPolicy field of the SystemConfig struct in
schema/capos.capnp. It is config, not a doc-signing ritual, and is not gated
on any specific person. The field is absent by default, and the absent default
applies the probe-gated runtime selection rule above. The enum values and their
interaction with the probe result are:
| Value | Probe verifies usable+safe IOMMU | Probe cannot verify | Notes |
|---|---|---|---|
| (field absent) | direct remapping domain | bounce-buffer fallback | Default: the probe-gated runtime selection rule. Direct-DMA when the probe verifies a usable+safe IOMMU, bounce-buffer fallback otherwise (fail-closed). Identical to enable-if-verified. |
enable-if-verified | direct remapping domain | bounce-buffer fallback | The explicit, auditable form of the default. Probe-gated direct-DMA with fail-closed bounce-buffer fallback. Redundant with the absent default but kept for explicit configuration. |
enable-unsafe | direct remapping domain | direct remapping domain | Force direct-DMA even when the probe cannot verify it. The operator takes responsibility for the platform’s DMA isolation; the value name carries the warning. Use only on a platform whose isolation is known-good out of band. |
bounce-buffer | bounce-buffer fallback | bounce-buffer fallback | Pin the labeled bounce-buffer path and disable direct-DMA entirely, even where the probe would verify a usable IOMMU. The most conservative value. |
Selection rules that hold for every value:
- The absent default and
enable-if-verifiedselect direct DMA only when the probe verifies a usable+safe IOMMU, and otherwise fall back to bounce-buffer. enable-unsafeis the sole value that can pick direct DMA without probe verification. The value name is the acknowledgement; there is no separate per-shape “I-accept-unverified” ceremony.bounce-buffernever selects direct DMA, even where the probe would verify a usable IOMMU.- When the selected backend is
Unsupportedfor a device (no manager-ownable DMA-capable surface at all), the device stays unbound regardless of the override value. The override governs direct-vs-bounce, not whether an unbindable device is forced online.
This selection mechanism is implemented. The dmaBackendPolicy capnp enum
encodes an absent field as ordinal 0 (unspecified), which decodes identically
to enable-if-verified; an unrecognized ordinal decodes fail-closed to the
bounce-buffer path (never direct DMA) rather than failing the manifest parse or
honoring the probe-gated default. The kernel resolves the backend on each boot
from the IOMMU probe verdict and this override and emits a boot proof line of
the form dma: backend selection dma_backend=<direct-remapping|bounce-buffer> dma_backend_override=<absent|enable-if-verified|enable-unsafe|bounce-buffer> probe_verified_usable_iommu=<bool>. The bounded QEMU shapes prove the
probe-gated default end-to-end: make run-iommu-remapping (verifiable Intel
VT-d shape) records dma_backend=direct-remapping dma_backend_override=absent,
and make run-dmapool-grant (no usable IOMMU) records
dma_backend=bounce-buffer dma_backend_override=absent. The override values and
the unknown-ordinal fail-closed decode are covered by cargo test-config over
the shared selection rule, capnp round-trip, and CUE decode.
The production cloud (non-qemu) build emits the same backend selection on
the cloudboot harness’s serial-port-1 path as
cloudboot-evidence: dma-backend bounce_buffer (in the harness token
namespace), and on the bounce-buffer verdict drives the production-path
bounce-buffer DMAPool + DMABuffer grant proof
(kernel/src/cap/dmapool_bounce_buffer_grant_proof.rs, called from
kernel::run_init): stage a parked manager-attached DMAPool over one
DMA-capable PCI function from the inventory through
device_manager::stage_bounce_buffer_dmapool_record
(kernel/src/device_manager/stub.rs), allocate one bounded bounce-buffer
DMABuffer through
device_manager::issue_manager_attached_dmabuffer_handle_with_request
(which calls
device_dma::allocate_manager_attached_dmapool_bounce_buffer_page),
assert the cap-info labels (userspace_dmapool=manager-issued-bounce-buffer,
allocation=single-bounce-buffer-page, real_dma=not-attempted,
direct_dma=blocked, host_physical_user_visible=0,
iova_export=disabled-future-only), quiesce-before-release
(release_dmapool_record_for_cap_release returns pending-buffer-release
while the buffer is live), scrub-before-reuse (the released bounce-buffer
frame is zeroed in place between scrub and frame-free), and
stale-handle-after-detach, then emit
cloudboot-evidence: dma-pool-grant <token> with shape
<seg>.<bus>.<dev>.<fn>-pool.<slot>.gen.<gen>-phys.<hex> (every character
inside the harness grammar [A-Za-z0-9._-]+). The proof stages a pool,
allocates one bounce-buffer page, asserts the invariants, and emits a marker:
it does not program real DMA, attach a queue, program interrupts, claim a
device for sustained ownership beyond the grant, or emit
provider-nic-bound / storage-bound. A future direct-remapping verdict
skips the proof rather than aliasing direct-DMA onto the bounce-buffer
assertion shape.
Implementation note, 2026-05-29 11:50 UTC: the production-cloud bounce-buffer stub
now implements the cap-side DMABuffer.map(R+W) / DMABuffer.unmap
admission and state-machine entry points
(validate_dmabuffer_map_admission, record_dmabuffer_user_mapping,
begin_dmabuffer_user_mapping_unmap, restore_dmabuffer_user_mapping,
clear_dmabuffer_user_mapping in kernel/src/device_manager/stub.rs) so a
userspace holder of a manager-issued DMABuffer cap can map the single
bounce-buffer page read/write, write payload bytes through that mapping,
unmap it, and observe DMAPool.info.mapped_vmas reflect the live
mapping (make run-cloud-dmapool-grant). The same Absent -> Mapped -> Unmapping -> Absent state machine the QEMU side enforces governs the
parked slot: a duplicate map fails closed with the DmaPoolLive shape,
a clear before the cap-side aspace unmap returns
DmaPoolTeardownEvidenceInvalid, an identity-mismatched begin/clear
fails closed, and a post-freeBuffer map fails closed with the
DmaBufferStaleHandle shape. The manager remains the single owner of the
bounce-buffer page’s device-visible host-physical address and IOVA: the
mapping does not expose either to userspace, real DMA stays
not-attempted, direct DMA stays blocked, and IOVA export stays
disabled-future-only. This is a local-QEMU proof of the userspace mapping
path only; it does not unlock a live cloud NIC bind, IOMMU programming, or
production direct-DMA authority.
Implementation note, 2026-06-03: Phase C slice 2
(cloud_virtio_net_userspace_ownable_vring_proof,
make run-cloud-prod-nic-driver-userspace-ownable-vring) wires this landed
bounce-buffer authority to a userspace virtio-net driver’s own vring without
adding a new isolation backend. The driver allocates its descriptor / available /
used ring pages through a granted DMAPool, and the kernel programs the
virtio queue-address registers (queue_desc / queue_driver / queue_device)
with the manager-owned bounce host-physical address. The brokered-address-
publication model (above) holds: the driver never authors a device address. It
writes an opaque per-buffer device-usable handle (exported through
DMABuffer.info.deviceIova with scope bounce-handle, a deterministic
non-address encoding of the buffer’s manager identity), and the kernel resolves
that handle against the live grant ledger
(device_manager::stub::resolve_virtio_vring_device_address) to the real
host-physical address before any MMIO write. The no-host-physical-exposure
invariant is preserved end-to-end: host_physical_user_visible=0,
iova_export=disabled-future-only, and reads of the queue-address base
registers are refused so the resolved address is never read back into userspace.
Out-of-grant, host-physical-looking, and stale-generation handle writes fail
closed, mirroring the NVMe doorbell-scan reject classes. queue_enable /
DRIVER_OK stay fail-closed (slice 3); this is a local-QEMU proof of the
userspace-ownable-vring path only, not a live cloud NIC bind.
Downstream-Contract Scaffolding
A cloud NIC/storage driver declares its chosen backend through the device-manager policy fields that already label the local paths in this design. The contract below scaffolds the values each candidate requires; it is the shape a driver preflight declares, not an authorization to enable any value.
| Policy field | Direct remapping domain | Labeled bounce-buffer fallback | Unsupported |
|---|---|---|---|
direct_dma | enabled for the programmed per-device domain | blocked | blocked |
trusted_domain | manager-owned domain id for the device | none | none |
bounce_buffer | not required (mapped IOVA path) | required | N/A (device unbound) |
remapping_tables | programmed | not-programmed | not-programmed |
hostile_hardware_isolation | claimed only with per-domain remapping evidence + IOMMU hostile smokes | not-claimed | N/A |
exported_device_addresses | iova-only, domain-labeled | none (no host physical or IOVA exposed) | none |
Each candidate must satisfy the following gates, mapped to the assurance-model invariants in DMA Assurance Model:
- Direct remapping domain — mapping before publication; no page reuse
before teardown, with mapping removal and observed invalidation completion
ordered strictly before scrub/reuse; stale-handle and stale-completion
fail-closed; domain-scoped aliasing only (per-device context entries where a
peer domain cannot resolve another domain’s IOVA); no host-physical exposure
(export only the domain-scoped IOVA, labeled as meaningless outside the
domain); backend evidence explicit. The only landed remapping-domain evidence
is the bounded QEMU Intel path in
ddf-iommu-remapping-production-closeout, exercised bymake run-iommu-remapping. A cloud shape needs its own guest-programmable remapping evidence before this candidate applies to it. - Labeled bounce-buffer fallback —
direct_dma=blocked; all device-visible memory is manager-owned bounce pages; no host physical address and no genericMemoryObject/FrameAllocatorauthority is exposed to the driver; stale-handle, stale-completion, and exit-under-DMA teardown fail closed;hostile_hardware_isolation=not-claimedstays explicit. The landed evidence is the S.11.2.7/8/9 hostile-smoke rows enforced by themake run-netgate (tools/qemu-net-smoke.sh), with the bounce-buffer virtio-net provider evidence inddf-provider-virtio-net-driver-closeout, plus the bounce-bufferDMAPoolgrant lifecycle inmake run-dmapool-grant. As ofddf-real-dma-s112-hostile-smokes, the S.11.2.7 stale-IRQ-after-reset and S.11.2.8 stale-DMA-completion-after-reset closure summaries are asserted on the dedicatedmake run-dmapool-grantgate as well, so the DMA-grant gate fails closed on a hostile-row regression without depending onmake run-net; exit-under-DMA (in-flight drain, scrub, page free) is enforced bymake run-dmapool-grant-exit. See the “Fallback Policy For No Usable IOMMU Exposure” and “Hostile-Smoke Acceptance Matrix” sections in this document for the full gate text. - Unsupported — the device stays unbound or disabled; no driver-visible DMA, MMIO doorbell, interrupt ownership, or storage/network readiness claim is made. A serial boot line or PCI enumeration line is not readiness evidence.
The models/dma/ TLA+ and Alloy files, the extracted Kani core, and the focused
Loom model are checked bounded evidence for these invariants. They supplement
the QEMU/host evidence above; they do not satisfy a candidate hardware or cloud
backend gate by the mere presence of model files. Each checked result records
its tool version, command, bounds, and output in
../models/dma/README.md, and the CI placement is
tracked in Security and Verification.
First QEMU Intel Remapping Smoke Acceptance Gate
Decision recorded 2026-05-14 09:07 UTC. The first slice that programs real
Intel VT-d remapping state under QEMU has an explicit, bounded acceptance gate.
This unblocks the
ddf-iommu-qemu-intel-remapping-smoke
task, whose ## Acceptance section carries the full gate; the summary here is
the design-level decision.
The first slice programs the minimum Intel VT-d path for exactly one selected
DMA-capable function under a pinned QEMU q35 -device intel-iommu,aw-bits=39
shape: one root entry, one context entry bound to a device-manager-owned domain
ID and a 39-bit second-level page-table root, and a single second-level mapping
from one device-visible IOVA page to one kernel-owned DMAPool page, plus the
root-table-address register write and the global-command/global-status
translation-enable handshake. Acceptance requires observable proof that the
mapped IOVA was translated and that an out-of-domain IOVA faults closed in the
fault-status/fault-recording registers.
Invalidation is part of the gate, not a follow-up. On revoke, device reset,
driver death, and DMAPool page release, the slice must remove the
second-level mapping and invalidate the relevant context-cache and IOTLB state
through the selected invalidation interface, observe invalidation completion,
and order page scrub/reuse strictly after that completion. This sits at the
existing “remove IOMMU mappings” and “scrub and free DMA pages” steps of the
DeviceOwnerState revocation order and must not be reachable before
QueuesQuiesced or a completed Resetting transition.
IOVA export discipline: host physical addresses stay hidden from userspace in
every result cap, diagnostic line, and audit record. The selected QEMU Intel
IOMMU-backed DMABuffer.info path may export only the domain-scoped IOVA for
the live mapped buffer generation, explicitly labeled as meaningless outside
that domain; fallback bounce-buffer paths keep IOVA export disabled.
Per-device domain granularity: the selected QEMU Intel path programs two distinct per-device context entries and second-level roots for two claimed DMA-capable functions under the same DRHD. Both domains may use the same IOVA, but the peer domain’s second-level walk is proven not to resolve that IOVA to the primary page; stale and wrong-owner domain assignment fail closed, and trusted multi-device sharing groups remain disabled.
The kernel-owned bounce-buffer fallback stays the path for VM shapes without
usable remapping hardware and must remain explicitly labeled
(fallback_policy=kernel-owned-bounce-buffer-only,
remapping_tables=not-programmed,
hostile_hardware_isolation=not-claimed); it must never be silently
reinterpreted as direct-DMA or hostile-hardware isolation. The IOMMU-backed
path adds stale-DMA-handle, stale-completion, descriptor-abuse,
revoke/reset-race, teardown-under-DMA, cross-domain stale-handle, and
fail-closed teardown branch hostile smokes while the existing bounce-buffer
exit-under-DMA and stale-DMA evidence (the device-dma stale-handle and
stale-completion proofs and the S.11.2.7/S.11.2.8 closure summaries) is
preserved unchanged.
Explicitly out of scope for this first slice: AMD-Vi table programming;
trusted multi-device sharing groups; scalable-mode context entries; interrupt
remapping and device-IOTLB options; 48-bit IOVA space / 4-level tables;
production NIC or storage driver ownership; userspace DMAPool direct-DMA
authority; and moving the live virtio-net path off bounce buffers. The
acceptance evidence is QEMU-only emulator evidence, not a hardware-isolation
claim. The smoke adds or selects a focused make run-iommu-remapping gate
asserted by tools/qemu-iommu-remapping-smoke.sh.
Implementation note, 2026-05-02 04:58 UTC: the ACPI discovery path recognizes DMAR and
IVRS in the root table walk, reports absent/valid/malformed/unsupported state,
records bounded table length/header facts, DMAR host address width and flags,
IVRS IVinfo/flags, and bounded remapping-structure type counts. Malformed DMAR
or IVRS structure lengths stop parsing, and unsupported shapes such as parser
scan-cap overflow leave direct_dma=blocked with bounce_buffer=required.
Implementation note, 2026-05-02 05:31 UTC: the attachment-policy slice also
retains DMAR DRHD include-all and bounded PCI endpoint device-scope metadata,
including segment, single-hop BDF, and remapping-hardware register base, and
reports each retained DMA-capable PCI function as IOMMU-attached/covered when
that static table metadata covers its segment/BDF. Bridge and multi-hop scopes
are diagnostic-only until PCI topology traversal can resolve them, and
include-all fallback fails closed when retained DRHD units or scopes are
capped. Functions without trusted static coverage are reported as uncovered;
covered functions are reported as attached/covered, but both paths keep
dma_policy=prototype-bounce-buffer-only, bounce_buffer_required, and
blocked_direct_dma_devices because remapping domains are unsupported. The
direct-DMA trusted-domain count remains zero, and userspace DMAPool,
DeviceMmio, and Interrupt authority remain unavailable.
Implementation note, 2026-05-02 07:27 UTC: the domain-policy staging slice adds
a pci: dma-domain policy proof line and a diagnostics mirror. The proof
reports the future domain owner as the device manager, the domain granularity
as per-device or trusted-sharing-group, exported device addresses as IOVA-only,
host_physical_user_visible=0, direct_dma_trusted_domains=0,
claimed_device_domains_ready=0, remapping_tables=not-programmed,
remapping_domains=not-started, userspace DMAPool/DeviceMmio/Interrupt
as not-started, and prototype devices as kernel-owned bounce-buffer-only.
Malformed, unsupported, absent, or retained-capped metadata leaves direct DMA
blocked; proof_result=ok is only evidence for that conservative
blocked-direct-DMA policy.
Implementation note, 2026-05-09 18:47 UTC: the blocked-direct-DMA admission
decision now lives in the pure capos-lib::device_authority validator next to
the DMA/MMIO/IRQ handle validators. Host tests cover the current all-prototype
bounce-buffer shape, fail-closed results if any direct trusted domain is
claimed before the policy is ready, fail-closed results if the prototype
bounce-buffer count does not cover every DMA-capable function, and the absent,
malformed, unsupported, and retained-capped metadata labels. The kernel PCI
proof line and diagnostics mirrors consume that pure decision while preserving
the existing direct_dma=blocked, remapping_tables=not-programmed,
domain_activation=not-started, and policy=blocked-direct-dma labels. This
is IOMMU/remapping groundwork only; it does not program remapping tables,
create trusted domains, expose host physical addresses, or enable production
userspace DMA authority.
Implementation note, 2026-05-02 15:29 UTC: the COM1 devices diagnostics
command now prints the same bounded DMA-domain policy facts without naming an
owner identity. The line explains that all current DMA-capable prototype
functions remain on direct_dma=blocked, bounce_buffer=required,
direct_dma_trusted_domains=0, claimed_device_domains_ready=0,
remapping_tables=not-programmed, exported_device_addresses=iova-only,
host_physical_user_visible=0, and
prototype_devices=kernel-owned-bounce-buffer-only. This is a diagnostics
mirror for the current conservative policy, not evidence that IOMMU remapping
domains or userspace DMA authority exist.
Implementation note, 2026-05-02 15:45 UTC: attached device-manager DMAPool
records now store the current explicit bounce-buffer policy and the QEMU
device-manager proofs read it back through the active device record plus the
matching DmaPoolHandle. The logged policy scope is
device-manager-attached-dmapool-bounce-buffer-policy, with
direct_dma=blocked, bounce_buffer=required, trusted_domain=none,
remapping_tables=not-programmed, remapping_domain=not-started,
userspace_dmapool=not-started, host_physical_user_visible=0, and
policy_bound_to_manager=true. This binds the conservative policy to current
manager state; it still does not program remapping domains, expose userspace
DMAPool, or perform real DMA mapping teardown.
Implementation note, 2026-05-11 00:00 UTC: attached device-manager DMAPool
policy records now also carry an explicit manager-owned remapping-domain
ledger staging record. The lifecycle and imported-live proofs report
remapping_domain_ledger_scope=device-manager-attached-dmapool-remapping-domain-ledger,
static_iommu_coverage=acpi-pci-diagnostic-only,
remapping_domain_owner=device-manager,
remapping_domain_granularity=per-device-or-trusted-sharing-group,
remapping_domain_ledger=manager-owned-staging-record,
remapping_domain_ready=false, and
iova_export=disabled-future-only, while preserving
direct_dma=blocked, remapping_tables=not-programmed, and
host_physical_user_visible=0. This is a software ledger/readiness record
only: capOS still does not program Intel VT-d/AMD-Vi/QEMU remapping tables,
create a trusted direct-DMA domain, expose host physical addresses or IOVAs,
or claim production hostile-hardware DMA isolation.
Implementation note, 2026-05-11 17:07 UTC: the same manager-owned
remapping-domain staging record is now an explicit activation gate tied to the
active DMAPool record and matching handle. The device manager validates that
gate before current DMAPool policy/accounting and buffer issue paths proceed.
The gate reports domain_ownership=manager-owned-active-dmapool, but keeps
direct_dma=blocked because remapping_table_programming=not-programmed,
iova_export=disabled-future-only,
remapping_invalidation_policy=not-installed,
remapping_iotlb_flush_policy=not-installed, and
remapping_stale_mapping_cleanup=not-installed; the selected fallback remains
remapping_fallback_policy=kernel-owned-bounce-buffer-only. The activation
result is blocked-remapping-prerequisites-missing with
remapping_activation_gate=fail-closed,
remapping_activation_blocker=remapping-tables-not-programmed, and
remapping_activation_side_effect=side-effect-blocked. This is a software
policy gate and proof surface only: capOS still does not program Intel
VT-d/AMD-Vi/QEMU remapping tables, create a trusted direct-DMA domain, export
IOVAs or host physical addresses, remove real IOMMU mappings, flush IOTLB
state, or prove IOMMU-backed hostile stale-DMA behavior.
Implementation note, 2026-05-12 18:49 UTC: the manager-owned staging record
now includes a concrete per-device remapping-domain identity for the active
DMAPool handle: claimed-device domain identity, staged single-device sharing
group, BDF-derived device id, pool slot, pool generation, and owner generation.
The activation preflight treats that identity binding as a prerequisite before
direct DMA could be considered, and the QEMU lifecycle/imported-live proofs emit
a dmapool remapping domain identity proof line. The direct-DMA blocker remains
unchanged: remapping tables are still not-programmed, IOVA export is still
disabled-future-only, invalidation, IOTLB flush, and stale mapping cleanup are
still not-installed, direct DMA remains blocked, and the fallback remains
kernel-owned-bounce-buffer-only.
Implementation note, 2026-05-12 21:19 UTC: the same manager-owned
remapping-domain ledger now carries a separate mapping-lifecycle preflight
record. The record is bound to the active DMAPool handle and claimed-device
domain identity, and the existing device-manager policy gate validates it
before accepting the current bounce-buffer attach/accounting/buffer-issue
paths. Its direct-DMA result remains fail-closed with explicit blockers:
IOVA space, mapping install, removal before page reuse, invalidation policy,
IOTLB flush policy, and stale mapping cleanup are all not-installed. This
is still an in-repo software preflight only; it does not program remapping
tables, expose IOVAs or host physical addresses, enable direct DMA, remove
real IOMMU mappings, flush IOTLB state, or prove IOMMU-backed hostile
stale-DMA behavior.
Implementation note, 2026-05-12 22:26 UTC: capOS now has an Intel/QEMU
remapping table scaffold that can represent a DRHD identity field, PCI
segment/BDF/source ID, domain ID, QEMU Intel address-width choice, disabled
root/context entries, and a second-level page-table-root placeholder. PCI
diagnostics can bind that scaffold to discovered DRHD/segment metadata when
ACPI/PCI discovery provides it. The disabled backend registry’s only accepted
active state is disabled. The proof labels distinguish representability
from programming: root-table pointer, context-entry programming, invalidation
registers, fault registers, protected-memory registers, and invalidation
queue remain not-written; remapping tables remain not-programmed;
hardware programming remains not-attempted; direct DMA remains blocked;
IOVA export remains disabled-future-only; and host physical addresses remain
hidden from userspace. This is still not Intel VT-d, AMD-Vi, or QEMU IOMMU
programming.
Implementation note, 2026-05-23 18:06 UTC: the first production DMAPool ledger
integration for the QEMU Intel remapping path now maps the selected
virtio-rng request-buffer IOVA to an active manager-owned DMABuffer page.
Mapping install is admitted through the matching active DmaPoolHandle and
DmaBufferHandle generations, stale pool and buffer generations fail closed,
and wrong-owner mapping attempts are side-effect blocked. On teardown, the
target second-level leaf is removed, the context-cache and IOTLB completion
polls finish, and only then is the DMABuffer released through the production
device-manager ledger. The proof keeps the IOVA internal, keeps
host_physical_user_visible=0, keeps userspace IOVA export disabled for this
slice, and leaves the no-remapping fallback policy as
kernel-owned-bounce-buffer-only.
Implementation note, 2026-05-23 19:18 UTC: QEMU Intel remapping fault
reporting now decodes VT-d FSTS plus FRCD[0] into a bounded kernel record
for the faulting IOVA, reason, requester source ID, and DMA read/write type.
The unmapped-IOVA, stale-handle, and stale-completion proofs record the fault,
clear it with write-1-to-clear semantics, verify the clear-after-record state,
and report source/IOVA match status without exposing host physical addresses.
The COM1 devices diagnostics path now prints an IOMMU fault summary that
reserves fault_summary=clean for a successful clear fault-status read and
labels unavailable fault-status reads as unavailable/fail-closed. Owner
identity, DRHD register bases, and host physical addresses stay hidden. The
optional audit route is explicitly not wired for this slice
(volatile_audit=not-routed).
Implementation note, 2026-05-13 15:29 UTC: active manager-owned DMAPool
remapping preflight records now consume the same retained DMAR DRHD/requester
metadata when PCI coverage is complete. The active disabled Intel/QEMU
scaffold records the retained DRHD identity and requester segment/BDF/source
ID for the bound pool handle, but absent, malformed, capped, unsupported, or
uncovered metadata still leaves the scaffold not-bound and disabled. This is
metadata-only binding: capOS still does not program root/context tables,
install or remove mappings, invalidate remapping caches, flush IOTLB state,
export IOVAs, enable direct DMA, or expose host physical addresses.
Implementation note, 2026-05-12 23:07 UTC: PCI diagnostics now include a
separate Intel/QEMU remapping MMIO-status proof for the selected DMA-capable
function. When complete retained DMAR DRHD metadata covers that function and
the register base is page-aligned, capOS maps only the selected
remapping-register page for bounded volatile diagnostic reads of the version,
capability, global-status, root-table-address, and fault-status registers. The
mapped label describes the diagnostic access pattern, not page-table write
protection. When the default diagnostics shape has no DRHD, metadata is capped,
or the retained DRHD base is invalid, the same proof reports
mmio_window=not-mapped, mmio_read=not-attempted, unavailable
capability/status/fault reads, and a fail-closed reason. The labels preserve
remapping_tables=not-programmed, direct_dma=blocked,
fallback_policy=kernel-owned-bounce-buffer-only, and
hostile_hardware_isolation=not-claimed. This is not remapping-domain
activation: capOS still does not write VT-d, AMD-Vi, or QEMU remapping
registers, install root/context tables or invalidation queues, export IOVAs,
or claim hostile-hardware DMA isolation.
Implementation note, 2026-05-13 01:20 UTC: active manager-owned DMAPool
records now also carry a generic disabled IOVA ledger under the
remapping-domain record. The ledger binds a domain-scoped reservation identity,
hidden internal range metadata, and reservation generation to the active owner
and pool generation. The same device-manager accounting path validates that
ledger before current bounce-buffer allocation, descriptor submission,
completion accounting, buffer free/page release, and pool release checks
proceed. Pure tests and QEMU proof labels reject stale reservation generations
and wrong owner generations as disabled-iova-ledger-stale with side effects
blocked. The active state remains disabled: proof output keeps the internal
range hidden and reports iova_base_user_visible=0,
host_physical_user_visible=0, iova_export=disabled-future-only,
direct_dma=blocked, mapping_install=not-installed,
mapping_remove_before_page_reuse=not-installed,
invalidation_policy=not-installed, iotlb_flush_policy=not-installed, and
stale_mapping_cleanup=not-installed. This still does not program remapping
tables, export IOVAs, expose host physical addresses, install or remove real
IOMMU mappings, flush IOTLB state, or claim hostile-hardware isolation.
Implementation note, 2026-05-11 18:44 UTC: the selected userspace virtio-net
TX provider smoke now grants a runtime-visible DeviceMmio notify BAR cap
named notify_mmio, but keeps the active DMA posture unchanged. The provider
still uses manager-owned bounce buffers, direct_dma=blocked, and
host_physical_user_visible=false; the notify cap is a no-write MMIO
admission boundary over the selected virtio-net TX notify offset, not a direct
DMA or descriptor-ring ownership transfer. The selected submit path validates
descriptor authority and scrubs the bounce page before consuming the
grant-derived notify policy, and it proves wrong value, wrong offset, stale
handle, and stale generation block before any doorbell. This does not program
IOMMU/remapping tables, export IOVA or host physical addresses, mutate the
real virtio-net descriptor ring, or claim production NIC isolation.
Implementation note, 2026-05-11 19:05 UTC: the notify_mmio grant remains
runtime-visible but is now explicitly no-direct-MMIO as well as no-write.
DeviceMmio.map and DeviceMmio.read32 for that provider notify cap return
typed blocked results before any user VMA or register read, DeviceMmio.info
validates the live mapping generation before accepting the provider-notify
record, and notify_mmio detach clears the submit-path notify policy so a
later selected submit reports stale-handle blocking with no doorbell write.
The submit path also invalidates the cached notify policy on owner-generation
transitions and stale/missing cap-release detach boundaries before accepted
no-write authority can be reported.
Implementation note, 2026-05-11 19:56 UTC: the same provider smoke now grants
a runtime-visible Interrupt cap named tx_interrupt for the selected
virtio-net TX MSI-X route snapshot. This extends the grantable authority
boundary without changing the DMA posture: the provider still uses
manager-owned bounce buffers, direct DMA and IOMMU remapping remain blocked,
and the interrupt cap only proves generation-checked admission for
info/ack/unmask/wait/mask plus waiter cancellation and stale-after-release
blocking. Later bounded follow-ups add selected TX MSI-X vector-control
mask/unmask; this first grant slice did not deliver provider IRQs to userspace,
acknowledge or mask/unmask hardware, ring a doorbell, or mutate real
virtio-net descriptor rings.
Implementation note, 2026-05-11 20:09 UTC: provider TX waiters are now
separate no-delivery waiter-table entries rather than generic delivery
waiters. A pending tx_interrupt.wait remains pending across TX route
delivery-count advancement and only completes through the explicit
mask/release cancellation path. The staged provider TX grant source also
tracks a live-issued cap and refuses another live tx_interrupt alias for the
same route snapshot.
Implementation note, 2026-05-12 09:13 UTC: the selected userspace virtio-net
TX provider path now performs one bounded real descriptor/avail publication on
queue 1 while keeping the DMA posture conservative. The descriptor points at
the manager-owned bounce buffer already governed by the DMABuffer record;
the submit path validates live buffer identity, scrubs before publication,
requires the live no-write notify_mmio policy, asserts that the descriptor
page is ledgered to the virtio-net TX queue, and blocks wrong queue, stale
notify policy, and a real stale DMABuffer.submitDescriptor(queue=1) attempt
at the stale capability/liveness boundary before touching the real ring. The
proof logs descriptor/avail/used ring physical addresses for kernel evidence,
but those addresses are not returned to userspace. Because this slice does not
claim real used-ring/CQ completion, the
published page remains pinned in the manager in-flight record; userspace
completeDescriptor, freeBuffer, post-publication remap, and cap-release
drain do not retire, remap, or release it. A follow-up rings one selected TX
virtqueue notify doorbell after the same descriptor authority, submit-scrub,
live notify_mmio policy, submit-effect write, and publication gates, while
wrong-queue and stale-notify or stale-DMABuffer paths remain not-written.
Readback-mismatch publication failures do not write the immediate doorbell and
are treated as possibly-published ring state that quiesces later TX
notification and keeps the manager buffer pinned rather than claiming rollback.
Pre-publication bounce-page metadata remains doorbell_write=not-written.
Any immediate used-ring or IRQ effect from that doorbell is recorded only as
an out-of-scope hardware side effect. Later 2026-05-12 follow-ups advanced the
selected path to a bounded used-ring completion handoff:
DMABuffer.completeDescriptor validates the live manager-attached buffer and
in-flight descriptor id, observes the real TX used ring for the stored
software descriptor generation, consumes that entry through the existing
descriptor tracker and DMA ledger, and only then clears the manager in-flight
record. As of commit e248d42b (2026-05-23 13:36 UTC), kernel TX helpers
stay quiesced after provider ownership starts while the provider path can
publish the full selected TX queue-depth window of eight descriptors before the
first completion; the smoke records live_inflight_after_submits=1/2/3/4/5/6/7/8
(the ninth allocation rejected dmapool-budget-exceeded), blocks map/free/reuse
while any buffer is in flight, and proves wrong-order descriptor 7 used-ring
handling preserves the observed descriptor 0 completion for its matching
generation.
The
provider-facing
tx_interrupt waiter is a
runtime-visible completion-event consumer for the same selected route;
delivery validates the expected TX source id, source generation, route
generation, owner, driver-unmasked state, and live issue id before completing
each bounded event. A 2026-05-13 follow-up adds the bounded
incomplete-descriptor teardown drain: when one descriptor has completed and
seven remain incomplete, release retires only the incomplete descriptors’
allocation-backed TX DMA ledgers and clears only the selected virtqueue
descriptor/used-ring tracking needed for those releases, while CQ publication
and provider IRQ delivery stay blocked and the pending waiter remains
undelivered until the smoke explicitly cancels it through the existing
mask/cancel path. Commit e248d42b (2026-05-23 13:36 UTC) extends that drain
to the full selected TX queue-depth window and keeps the completed descriptor’s
buffer retained until it is explicitly freed. A later 2026-05-13 remediation
binds each provider TX
in-flight descriptor to the submission-time provider issue/source/route
generation. If that old descriptor completes after tx_interrupt
release/regrant, DMABuffer.completeDescriptor fails closed as
dmabuffer-provider-tx-stale-issue before consuming the used ring, publishing
provider CQ/IRQ state, or advancing provider acknowledgements; later cap
release may still drain the descriptor as teardown-only.
tx_interrupt.wait posting is serialized with provider release, mask, and
delivery, and stale issue ids fail closed at admission and insertion. A later
2026-05-13 follow-up lets tx_interrupt.acknowledge account exactly one
already observed selected TX dispatch token paired with one delivered provider
CQ event; the smoke proves pre-event, duplicate, teardown-drain, masked-route,
reset/regrant, stale-after-release, and stale issue acknowledgements fail closed
before delivery-count, route-state, CQ, ack-ledger, or hardware-dispatch-ack
mutation. This is still bounded selected-route evidence: provider IRQ
ownership, deferred EOI, LAPIC/MSI-X acknowledgement, direct DMA, IOMMU mapping,
full virtio-net ownership, production NIC/storage migration, and cloud
readiness remain open. Commit e248d42b (2026-05-23 13:36 UTC) adds
release-time retirement for delivered but unacknowledged bounded provider TX CQ
events: the release proof now records seven pending provider completion acks
retired from the ledger in one live issue while preserving the separate
stale-bound in-flight descriptor proof, with stale post-release ack revoked and
no hardware ack claimed. A 2026-05-13 follow-up adds bounded selected TX MSI-X
mask/unmask only: live provider tx_interrupt.mask and unmask toggle the
selected TX vector-control bit plus route state, preserve generations and
delivery counts, and block stale issues before side effects.
Implementation note, 2026-05-02 19:43 UTC: the bounded zero-live
device-manager DMAPool lifecycle proof now treats its manager-attached DMA
buffer record as teardown-blocking metadata. The pool detach path still checks
authoritative live accounting first, then rejects zero-live detach while the
proof buffer is attached as dmapool-buffer-attached. Before the active free
path, the proof validates stale same-slot and wrong-identity FreeBuffer
operations through
capos-lib::device_authority::validate_dma_buffer_operation. The wrong
identity cases cover wrong owner generation, wrong pool slot, wrong pool
generation, and wrong buffer slot; each records dmabuffer-stale-handle, the
exact validator reason (stale-owner-generation, wrong-pool,
stale-pool-generation, or wrong-slot), side-effect-blocked, and a
preserved manager-owned buffer record. The stale same-slot case continues to
record stale-slot-generation and buffer_stale_free_preserved=true, then
the proof observes that pool detach still fails as
dmapool-buffer-attached. The proof clears the gate only after validating a
proof-scoped active FreeBuffer operation, scrubbing and freeing the
kernel-owned proof frame, and detaching the manager-owned buffer record. This
remains lifecycle evidence only: no userspace DMAPool or DMA-buffer
authority is exposed, no physical address or IOVA is exposed, and S.11.2
hostile stale-DMA smokes remain open.
Implementation note, 2026-05-03 02:31 UTC: the same zero-live
device-manager DMAPool lifecycle proof now validates manager-record
CompleteDescriptor authority for the attached DmaBufferHandle. Active
completion validation records buffer_active_complete_result=ok; freed-buffer,
reused-slot generation, and stale-after-revoke completion attempts fail closed
as dmabuffer-stale-handle with exact pure-validator reasons and
side-effect-blocked. This is manager-record validation evidence only: it does
not complete a real descriptor, publish a completion queue entry, grant a
userspace DMABuffer, run real DMA, or clean up or reuse production
userspace DMA pages.
Implementation note, 2026-05-14 14:05 UTC (DDF IOMMU remapping Slice A1): the
first slice that programs real Intel VT-d remapping state under QEMU has
landed. The ## First QEMU Intel Remapping Smoke Acceptance Gate above defines
the full bounded gate; that gate is being delivered as a sequenced A1/A2/B/C
split (the slice was correctly scoped as bigger than one reviewable unit). This
note records Slice A1.
Pinned QEMU shape: qemu-system-x86_64 8.2.2, -machine q35,
-device intel-iommu,aw-bits=39 (3-level second-level page tables, 39-bit IOVA
space). The kernel iommu module (kernel/src/iommu.rs, cfg(qemu)-only)
selects one DMA-capable function that is not the live virtio-net
bounce-buffer path (virtio-rng under the default smoke shape), allocates a root
table, one context table, a 3-level second-level page table, and one mapped
DMA page; encodes and writes one root entry, one context entry (binding the
requester source id to a domain id, aw-bits=39, and the second-level root),
and the second-level table-pointer / leaf entries through the HHDM; writes the
Root Table Address Register; and runs the global-command/global-status
SRTP-then-TE handshake, polling the status register for each step. The
capability-register extended-capability IRO field (IOTLB register offset) is
decoded and reported for Slice B’s benefit. MMIO ordering invariants are
enforced with SeqCst fences: between the last in-memory table-entry write and
the RTAR write, between the RTAR write and GCMD.SRTP, and between the latched
root pointer and GCMD.TE.
Slice A1 proves translation with kernel-side structural evidence only, which
the gate’s IOVA-export-discipline clause explicitly permits. Hardware confirms
translation-enabled (GSTS.TES + GSTS.RTPS polled set), the written entry
words are read-back-verified through the HHDM, the pure
capos_lib::device_authority validator accepts the layout, and the unmapped
IOVA’s 3-level walk structurally terminates at a non-present entry. The proof
labels are scrupulously honest: mapped_iova_translated=structural (not
hardware-dma), unmapped_iova_fault=structural-not-present (not observed),
proof_evidence=kernel-side-structural. A real hardware-DMA translation and a
real fault-status fault require driving a device virtqueue through the IOMMU;
that is deferred to follow-on task A2 (a virtio-rng virtqueue driver as the
DMA proof vehicle). Invalidation + IOTLB flush with completion polling (Slice B)
and the IOMMU-backed stale-handle / stale-completion hostile smokes (Slice C)
are also follow-on slices; at A1 their proof lines emit
proof_result=deferred-next-slice. A2, B, and C have since landed — see the
implementation notes below.
The table pages are recorded in a bounded ledger modeled on the device-manager
DMAPool page-accounting discipline (allocate-record, scrub-before-free on the
fail-closed path); mapping removal with IOTLB-flush-ordered scrub/free is Slice
B. IOVA export stays disabled (iova_export=disabled-this-slice), no host
physical address is user-visible (host_physical_user_visible=0), and no
hostile-hardware isolation is claimed (hostile_hardware_isolation=not-claimed).
The kernel-owned bounce-buffer fallback is unchanged for QEMU shapes without
usable intel-iommu hardware and is emitted with the explicit
fallback_policy=kernel-owned-bounce-buffer-only /
remapping_tables=not-programmed labels. The new make run-iommu-remapping
gate is asserted by tools/qemu-iommu-remapping-smoke.sh; make run-net,
make run-dmapool-grant, and make run-diagnostics continue to prove the
fallback path unchanged.
Implementation note, 2026-05-14 15:19 UTC (DDF IOMMU remapping Slice A2): the device-DMA proof vehicle has landed, upgrading the A1 structural proof to a real hardware-DMA proof and closing the literal hardware-DMA text of gate part
- After the VT-d tables are programmed and
GCMD.TEis set, a minimal virtio-rng virtqueue driver — split into a mapped-DMA phase (crate::virtio::prove_iommu_rng_mapped_dma) and an unmapped-DMA phase (crate::virtio::prove_iommu_rng_unmapped_dma) — drives the device QEMU exposes on theintel-iommushape. The second-level table now installs four leaf entries inside one shared L1 page: the request buffer plus the three virtqueue ring pages (descriptor table, available ring, used ring). The driver programs the device’sQUEUE_DESC/QUEUE_DRIVER/QUEUE_DEVICEregisters and the request descriptor’saddrfield with the programmed IOVAs, never the host-physical page addresses. Because VT-d translation is global per requester onceGCMD.TEis set, every DMA the device issues — every ring access and the entropy write — must walk the second-level table. A used-ring completion plus a non-zero buffer reading therefore proves a real hardware DMA reached the kernel page through the programmed IOVA translation:mapped_iova_translated=hardware-dma,proof_evidence=virtio-rng-hardware-dma.
The driver then publishes a second descriptor whose addr is the
deliberately-unmapped IOVA and kicks the device. The device’s DMA to that IOVA
raises a real VT-d translation fault; the kernel reads it back out of the Fault
Status Register (FSTS.PPF), the first Fault Recording Register’s fault bit
(FRCD[0].F at the decoded CAP.FRO offset), and the faulting page address
recorded in FRCD[0] — which must equal the unmapped IOVA the device was
pointed at — and reports unmapped_iova_fault=observed with the
fault_recording_reason code. The fault gate is purely the VT-d register
surface: whether QEMU’s virtio-rng still pushes the faulting descriptor onto
the used ring afterward (it does, with the entropy write dropped) is QEMU
device behavior, reported as the unmapped_descriptor_uncompleted diagnostic
field but deliberately not a gate condition. The fault registers are cleared
(write-1-to-clear) before the device DMA and again after the observed-fault
read so no stale fault is mistaken for the proof and no fault is left for a
later VT-d consumer. The MMIO discipline reuses A1’s NO_CACHE mapping and the
descriptor/available-ring writes are SeqCst-fenced before the notify
doorbell. The two phases are deliberately split so the kernel reads the fault
registers strictly between them — the unmapped-IOVA descriptor is never in
flight while the mapped-DMA result is judged. The virtio-rng device negotiates
VIRTIO_F_ACCESS_PLATFORM, which is what makes QEMU route its DMA through the
platform IOMMU rather than treating the ring registers as host-physical
addresses; the run-iommu-remapping make target therefore creates the
virtio-rng device with iommu_platform=on (a target-scoped override of the
shared QEMU_SECOND_DEVICE, which no other run target needs because none of
them drives virtio-rng DMA). The IOMMU-backed hostile smokes (Slice C) were a
follow-on at A2 (proof_result=deferred-next-slice) and have since landed —
see the Slice C implementation note below; IOVA export stays disabled and no
host-physical address is user-visible.
Implementation note, 2026-05-14 17:19 UTC (DDF IOMMU remapping Slice B): the
invalidation + IOTLB flush + invalidation-ordered scrub/free has landed,
closing gate part 2. After the A2 hardware-DMA proof, kernel/src/iommu.rs
runs a revocation cycle (run_invalidation_revocation_cycle) that models the
device-manager DeviceOwnerState revocation FSM at the
QueuesQuiesced -> Resetting -> DmaMappingsRemoved -> Dead steps. The cycle
removes the four second-level leaf entries the A2 layout installed (request
buffer + the three virtqueue ring pages), SeqCst-fences so the in-memory
removal is visible to the IOMMU before the flush, then issues two
register-based invalidations: a context-cache invalidation through the
Context Command register (CCMD_REG at offset 0x28, CCMD.ICC set with
CCMD.CIRG global granularity, polling CCMD.ICC clear for completion) and a
domain-selective IOTLB invalidation through the IOTLB register at the
A1-decoded CAP.IRO offset + 8 (IOTLB.IVT set with IOTLB.IIRG
domain-selective granularity and the domain id in IOTLB.DID, polling
IOTLB.IVT clear and reading IOTLB.IAIG back non-zero to confirm the
request was serviced). Both completion polls are bounded by the same
VTD_STATUS_POLL_BUDGET the A1 status handshakes use.
The hard ordering invariant — the whole point of the slice — is that the eight
VT-d ledger-owned table/ring/used pages and the separate production
DMAPool-owned request-buffer page are scrubbed and returned to their ledgers
strictly after both completion polls return. A SeqCst fence sits between
the completion reads and the scrub/free so the ordering is explicit in program
order. A poll that exhausts its bounded budget fails closed:
invalidation_completed is false, the pages are deliberately not freed
(a page reused while hardware may still hold a cached translation through it is
a stale-DMA hole), the ledgers keep them accounted rather than
leaked-and-forgotten, and the proof line reports
proof_result=fail-closed. Slice B uses register-based invalidation only:
no GCMD.QIE queued-invalidation bit is set, so the A1 single-bit-GCMD
discipline (correct only by minimalism — no other persistent GCMD bit set)
still holds and the GCMD-reconstruct boundary is not crossed. The
production DMAPool programming-abort path follows the same rule: if VT-d
programming fails before root/translation state can expose the DMAPool page to
hardware, the prepared DMABuffer/DMAPool records are detached; if the mapping
may already be hardware-visible, the partial VT-d ledger is carried through
the same leaf-removal and invalidation teardown before any VT-d table/ring page
or production DMAPool page can be reused. The
make run-iommu-remapping smoke and tools/qemu-iommu-remapping-smoke.sh now
assert the invalidation proof line as proof_result=ok with
mapping_removed=true context_cache_invalidated=true iotlb_flushed=true
iotlb_actual_granularity_nonzero=true invalidation_completed=true
page_reuse_ordered_after_invalidation=true table_pages_live_after=0
invalidation_interface=register-based-ccmd+iotlb, and forbid both a
regression to the Slice-A deferred label and any invalidation_interface=queued
value. The IOMMU-backed stale-handle / stale-completion hostile smokes (Slice
C) were the deferred follow-on; they have since landed (see the note below),
and as part of that work the single-phase run_invalidation_revocation_cycle
was refactored into a two-phase run_target_revocation_phase +
complete_revocation_teardown so the hostile re-drive can sit between the
phases — the Slice B contract (every page freed strictly after its mappings
are invalidated) is unchanged, and the combined invalidation proof line
still asserts proof_result=ok for the complete teardown. IOVA export stays
disabled and no host-physical address is user-visible.
Implementation note, 2026-05-14 19:13 UTC (DDF IOMMU remapping Slice C): the
IOMMU-backed hostile stale-DMA smokes have landed, closing gate part 5 and the
parent IOMMU remapping task. Closing the slice required refactoring the Slice
B revocation into two phases so the hostile re-drive can run against a
partially revoked remapping — the original single-phase cycle freed every
page (including the virtio-rng descriptor table and available ring) before any
hostile re-drive could observe a fault, so the re-driven device read an
all-zero descriptor and issued no DMA at all. The kernel/src/iommu.rs ledger
now classes each page by revocation-phase role: Target (request buffer +
used ring — what the device’s DMA lands on), RingInfra (descriptor table +
available ring — what the device reads to issue a DMA), and Table
(root/context/second-level tables). run_target_revocation_phase removes the
Target second-level leaves, invalidates the context-cache + IOTLB, and frees
the Target pages — while deliberately keeping the RingInfra + Table
pages mapped and live. run_hostile_stale_dma_cycle then re-drives the
same still-live old-generation virtio-rng device through the new
crate::virtio::prove_iommu_rng_stale_dma (each re-drive uses a fresh
available-ring index past the A2 phases so the device sees a genuinely new
descriptor): because the ring-infra pages are still mapped, the device reads a
valid descriptor whose addr is a revoked target IOVA, so the DMA faults
in the IOMMU (FSTS.PPF + FRCD[0].F, recorded faulting address is the stale
IOVA) instead of reaching memory. A stale mapping-install attempt is refused
(attempt_stale_mapping_install — the RevokedRemapping token is a
dead-domain receipt with no live table handle, not install authority), and the
freed Target page reads back as the scrubbed zeros. A second re-drive at the
revoked used-ring IOVA faults too, publishes no device-written used-ring CQ
entry into the freed page, exposes no memory to a would-be new owner, and
makes no freed page eligible for reuse. complete_revocation_teardown then
finishes the Slice B teardown by revoking + freeing the RingInfra + Table
pages; the combined gate-part-2 invalidation proof line
(invalidation_phases=target-then-ringinfra) still asserts proof_result=ok
for the full two-phase teardown, with the same hard ordering invariant in each
phase (pages freed strictly after that phase’s invalidation completion polls
return; a poll that exhausts its budget fails closed and the phase’s pages are
not freed). The load-bearing observation is the revoked translation state
— not device cooperation, not a software ledger drop — blocking the stale DMA,
confirmed by the VT-d fault registers plus a freed-page-stays-scrubbed
read-back through the HHDM. Existing bounce-buffer stale-DMA evidence (the
device-dma S.11.2.7/S.11.2.8 proofs) is preserved unchanged; the
IOMMU-backed hostile smokes are strictly additive. The
make run-iommu-remapping smoke and tools/qemu-iommu-remapping-smoke.sh now
assert both hostile proof lines as proof_result=ok and forbid regression
to the deferred, not-reached, or fail-closed labels. IOVA export stays
disabled (iova_export=disabled-this-slice), no host-physical address is
user-visible (host_physical_user_visible=0), and no hostile-hardware
isolation is claimed (hostile_hardware_isolation=not-claimed).
Implementation note, 2026-05-23 (domain-scoped IOVA export): the selected QEMU
Intel production DMAPool path now exposes the mapped request-buffer IOVA
through DMABuffer.info only while the matching active DmaBufferHandle
generation is live. The schema fields are deviceIova,
deviceIovaScope=domain-scoped-iova,
deviceIovaMeaning=meaningless-outside-domain, and
iovaExport=domain-scoped-only; the production remapping proof asserts that
deviceIova=0x200000 matches the installed second-level mapping. After the
buffer is freed and the pool is released, an export attempt on the same handle
fails closed with side-effect-blocked. The bounce-buffer grant path still
reports deviceIova=0, deviceIovaScope=none,
deviceIovaMeaning=not-exported, and iovaExport=disabled-future-only.
Implementation note, 2026-05-23 21:34 UTC (production-path hostile smokes):
the selected QEMU Intel path now emits and asserts iommu-remapping: production dmapool hostile proof over the active manager-owned DMAPool /
DMABuffer ledger. The proof ties the raw VT-d stale-handle and
stale-completion faults to the production mapped IOVA, synthetic stale
pool/buffer generation mismatch candidates, post-teardown stale-handle export
failure, and per-device cross-domain boundary. It covers stale IOVA after
revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion
after reset, teardown-under-DMA ordering, and cross-domain stale-handle
attempts; no second-level entry is installed for stale authority, no CQ entry
is published, no new-owner memory is exposed, and page reuse stays ordered
after invalidation completion. It does not claim a process-exit trigger for
the IOMMU path; the existing make run-net bounce-buffer evidence remains the
exit-under-DMA source. The same smoke asserts the
complete_iommu_dmapool_mapping_teardown prerequisite-false return and the
hold_iommu_dmapool_mapping_ledger_after_abort path as fail-closed branch
evidence. Existing bounce-buffer S.11.2 evidence from make run-net and
make run-dmapool-grant is preserved unchanged.
Implementation note, 2026-05-26 05:55 UTC (direct-DMA posture transition for
the selected QEMU Intel path): the closeout slices above landed the full
mechanism — real hardware DMA over a manager-owned DMAPool DMABuffer page
mapped through the per-device IOMMU domain, domain-scoped IOVA export, per-device
domains, and the production hostile matrix — but deliberately deferred the
headline direct_dma=enabled claim behind iova_export=disabled-this-slice.
The selected QEMU Intel path now emits iommu-remapping: direct-dma posture real_dma=attempted direct_dma=enabled remapping_tables=programmed trusted_domain=<domain-id> descriptors_reference=domain-scoped-iova mapped_page_source=manager-owned-dmabuffer mapping_installed_before_doorbell=true invalidated_before_page_reuse=true bounce_buffer=not-required exported_device_addresses=iova-only host_physical_user_visible=0 hostile_hardware_isolation=not-claimed proof_result=ok. Every field is computed
from the real proof facts, not asserted as a constant: remapping_tables=programmed
requires the root/context/second-level entries written plus the SRTP/TES
handshakes; real_dma=attempted requires the virtio-rng device’s mapped DMA to
have completed through the programmed IOVA (hardware_dma_translation_proven);
direct_dma=enabled additionally requires the manager-owned DMAPool mapping to
have been installed before the device doorbell; and invalidated_before_page_reuse
folds in the two-phase revocation’s page_reuse_ordered_after_invalidation and
invalidation_completed results. This is bounded QEMU-emulator evidence, so
hostile_hardware_isolation stays not-claimed (real hostile-hardware isolation
needs real hardware, not QEMU). The no-IOMMU run-net / run-dmapool-grant
bounce-buffer fallback is untouched: it keeps direct_dma=blocked with no IOVA
export, and make run-iommu-remapping now forbids this path from regressing to
the bounce-buffer fallback proof or to a blocked/not-attempted posture. The
contract table in “Downstream-Contract Scaffolding” (direct-remapping domain:
direct_dma=enabled, remapping_tables=programmed,
exported_device_addresses=iova-only) is now backed by an emitted, asserted
posture on the selected path.
Authority Model
Device authority is split into three independent capabilities:
DMAPool: authority to allocate, expose, and revoke device-visible memory within a kernel-owned physical range or IOMMU domain.DeviceMmio: authority to map and access one device’s register windows.Interrupt: authority to wait for and acknowledge one interrupt source.
Holding one of these capabilities never implies the others. A driver needs all three for a normal device, but the kernel and init can grant, revoke, and audit them separately.
Production Handle Epoch Invariants
All three object families use opaque handles whose identity is checked against kernel-owned records before every operation. A raw object id is never enough to authorize DMA, MMIO, interrupt waits, acknowledgements, descriptor submission, or teardown. A handle is accepted only when all of these facts match in the same ownership transaction:
- the object id resolves to a live record of the expected type;
- the handle’s device owner generation matches the current device-manager owner record;
- the handle’s pool, mapping, slot, source, or route generation matches the current reusable subrecord;
- the record state permits the requested operation.
The exact ABI shape may change when the capability surface is implemented, but production handles must carry the equivalent identity:
#![allow(unused)]
fn main() {
struct DmaPoolHandle {
device_id: u32,
owner_generation: u64,
pool_id: u32,
pool_generation: u64,
}
struct DmaBufferHandle {
device_id: u32,
owner_generation: u64,
pool_id: u32,
pool_generation: u64,
slot: u32,
slot_generation: u64,
}
struct DeviceMmioHandle {
device_id: u32,
owner_generation: u64,
bar: u8,
mapping_id: u32,
mapping_generation: u64,
}
struct InterruptHandle {
device_id: u32,
owner_generation: u64,
source_id: u32,
source_generation: u64,
route_generation: u64,
}
}
Object identity fields have distinct jobs:
DMAPoolhandles name the claimed device, the device owner generation, and the pool record generation. Buffer handles issued by the pool repeat the device-owner and pool identity and additionally name a buffer slot and slot generation. The pool identity prevents a handle from crossing devices or owners; the slot identity prevents a freed or reused buffer slot from accepting an old descriptor, free, or completion.DeviceMmiohandles name the claimed device, owner generation, BAR or subrange mapping record, and mapping generation. The physical range, cache attributes, and access policy remain in the kernel record and are not user-editable handle fields.Interrupthandles name the claimed device, owner generation, source record, source generation, and route generation. Waiter records may carry their own waiter generation internally, but they must be invalidated whenever the source or route generation changes.
Owner generations and subrecord generations are intentionally separate. The
device owner generation belongs to the device-manager ownership record and
invalidates every DMAPool, DeviceMmio, and Interrupt handle for the old
owner when ownership is revoked, transferred, reset, or reassigned. Pool,
buffer-slot, MMIO-mapping, interrupt-source, and route generations belong to
records that may be reused below the device owner. They prevent stale buffer,
mapping, route, waiter, and completion handles from matching a newly allocated
subrecord even when the device id or pool id is reused.
Every epoch is non-wrapping for authority purposes. Implementations must use an epoch width that cannot wrap during the object’s lifetime, or permanently retire the exhausted device, pool, slot, mapping, source, or route record. Epoch exhaustion is a closed allocation or reassignment failure; it must never wrap back to a value that could match an old handle.
Generation mismatch, wrong object type, wrong device owner, freed slot, detached source, revoked mapping, and wrong device-owner state are hard closed results. The failed operation must not program a descriptor, ring a doorbell, perform an MMIO write, unmask or acknowledge an interrupt, wake a waiter, publish a CQE, decrement completion accounting, free a page for reuse, or mutate the device ledger except for bounded failure accounting or audit metadata.
Transfer, revoke, reset, and reassignment are ordered around those epoch checks:
- Transfer: The old owner leaves
Active, the owner generation advances, and old handles become invalid before a new owner receives handles. A transfer may preserve hardware state only after old interrupt notifications, MMIO write authority, and DMA submissions are either quiesced or represented by old-generation ledger entries that the new owner cannot consume. - Revoke: The device manager invalidates user-visible handles first, then follows the revocation order below: MMIO write authority removed, interrupts masked or detached, queues quiesced or reset, mappings removed, and pages scrubbed before release.
- Reset: Reset or disable advances the owner generation and any affected source, route, pool, mapping, and buffer-slot generations before new handles can be issued. If old DMA writes cannot be proven stopped, buffer slots stay unavailable until reset completion and mapping invalidation prove reuse is safe.
- Reassignment: Interrupt sources, MMIO mappings, and DMA pool records are detached or unmapped, their subrecord generations advance, pending waiters or completions are drained or marked stale, and only then can a new owner receive authority for the reused source, mapping, or slot.
Handle reuse rules:
- stale handles fail closed;
- freed-handle reuse fails closed;
- reallocated slots must not restore authority to old handles;
- old interrupt waiters must not observe or acknowledge a new owner’s interrupt source;
- old DMA handles must not reference a newly allocated buffer in the same slot.
Production proof obligations are split between host tests and QEMU smokes. Host tests must cover the pure validator and state-machine cases: stale owner generation, stale pool or mapping generation, stale buffer slot, stale interrupt source or route, wrong owner, wrong device, wrong object type, freed object, wrong state, epoch exhaustion/retirement, and no side effects on failure. QEMU smokes must prove the hardware-facing ordering: stale DMA handles after free/reuse cannot submit descriptors, stale DMA completions after revoke/reset cannot publish CQEs or mutate reused buffers, stale MMIO handles cannot ring doorbells after revoke, stale interrupt waiters or acknowledgements cannot wake or affect a new owner, and process-exit or driver-crash teardown reaches a zero-live ledger before pages are reused. These production handles and proofs remain open; the current QEMU scratch proofs are prerequisite evidence for this contract, not completion of it.
Implementation note, 2026-05-02 13:18 UTC: capos-lib::device_authority now
implements the bounded pure host-testable validator prerequisite for these
handle epoch invariants. The module models the documented handle and record
identity fields for DMAPool, DMA buffer, DeviceMmio, and Interrupt,
separates device-owner generations from pool, slot, mapping, source, and route
generations, returns explicit fail-closed error labels, blocks the relevant
side-effect class on validation failure, and refuses epoch wrap or retired
epoch reuse. This does not expose production userspace handles, wire kernel
device paths, attach budget/OOM policy to real handle creation, or complete
the QEMU stale-handle or S.11.2 hostile-smoke gates.
Implementation note, 2026-05-03: the pure host-test operation matrix now
enumerates every current validator operation variant:
DMAPool::{AllocateBuffer,IssueBufferHandle},
DMABuffer::{SubmitDescriptor,CompleteDescriptor,FreeBuffer},
DeviceMmio::{Map,Read,Write,RingDoorbell,Unmap}, and
Interrupt::{Wait,Acknowledge,Mask,Unmask}. Each row asserts active
acceptance plus stale owner/subrecord, freed, revoked, and retired failures
with the exact blocked side-effect class for that operation. This remains
ABI-independent host-test evidence only; it does not create production
userspace handles or replace the QEMU stale-handle and S.11.2 hostile-smoke
gates.
Implementation note, 2026-05-02 13:43 UTC: the current kernel device-manager
DMAPool lifecycle and imported-live accounting proofs now adapt their BDF,
owner generation, pool slot, and pool generation into
capos-lib::device_authority records. The QEMU proof records active validator
success, stale-after-revoke failure as dmapool-stale-handle, the validator
reason stale-owner-generation, and side-effect-blocked. This is still a
bounded kernel-proof adapter, not production userspace handle exposure,
DeviceMmio/Interrupt handle wiring, production page cleanup, or S.11.2
hostile smoke completion.
Implementation note, 2026-05-02 17:04 UTC: the current kernel device-manager
also has a bounded manager-owned DeviceMmio record proof adapter. The record
carries BAR, mapping id, mapping generation, and owner generation fields, and
the QEMU virtio-rng device-manager path validates a RingDoorbell operation
through capos-lib::device_authority. After revoke begins, the old handle
fails through the pure validator as stale-owner-generation, records
devicemmio-stale-handle plus side-effect-blocked, and no doorbell write is
attempted. The lifecycle proof blocks RevokingHandles -> MmioRevoked while
the record is attached, then allows the transition after bounded detach. This
does not expose production userspace DeviceMmio authority, program real BAR
mappings, create mapping objects, or complete hostile stale-MMIO smokes.
Implementation note, 2026-05-02 17:29 UTC: the bounded DeviceMmio adapter now
derives the proof mapping from the first decoded PCI memory BAR on the tested
PciDevice through the shared BAR-region validator. The attached manager-owned
record carries that BDF/BAR/base/length metadata and validates that it is the
same BDF, a memory BAR, nonzero length, and the same BAR named by the handle
before constructing the pure capos-lib::device_authority record. The QEMU
smoke asserts region_source=pci-decoded-memory-bar,
region_bound_to_manager=true, bar_present=true, bar_memory=true,
bar_base, and bar_length. This is still prerequisite evidence only: it does
not create userspace DeviceMmio handles, program real MMIO mappings, enforce
cache attributes or write policy, or write a real doorbell.
Implementation note, 2026-05-02 17:54 UTC: the same bounded DeviceMmio
adapter now records fail-closed malformed-region evidence before the positive
attach. Wrong-BDF metadata, wrong BAR/handle mismatch, and zero-length region
metadata all report devicemmio-region-invalid, with
region_invalid_mapping=not-created and
region_negative_side_effect=side-effect-blocked; the proof still records
real_mmio_mapping=not-programmed and real_doorbell=not-written. This is
bounded manager-proof evidence only. It does not create userspace
DeviceMmio handles, map real BAR pages, enforce cache attributes or write
policy, complete hostile stale-MMIO smokes, or perform a real doorbell write.
Implementation note, 2026-05-02 20:14 UTC: the bounded DeviceMmio adapter now
stores future mapping policy metadata on the attached manager-owned record and
reads it back through the active record plus matching DeviceMmioHandle. The
proof line records policy_scope=manager-attached-devicemmio-cache-write-policy,
cache_policy=device-uncacheable,
page_table_protection=capability-scoped-device-nx,
write_policy=claimed-registers-and-doorbells-only, executable=blocked,
userspace_devicemmio=not-started, host_physical_user_visible=0,
policy_bound_to_manager=true, and policy_result=ok. A tampered cache/write
policy record fails closed with policy_tamper_result=fail-closed,
policy_tamper_mapping=not-created, and
policy_tamper_side_effect=side-effect-blocked. This is still metadata proof
only: no PAT/MTRR or page-table programming is performed, no userspace
DeviceMmio handle is created, no real BAR mapping object exists, and no
doorbell is written.
Implementation note, 2026-05-02 20:45 UTC: while the same bounded
manager-owned DeviceMmio record is still active, the proof now validates
hostile RingDoorbell handles through a proof-scoped adapter that uses the
already-attached record rather than manager lookup short-circuits. Wrong owner
generation, wrong mapping generation, wrong mapping id, wrong BAR, and wrong
BDF/device fail closed with exact pure-validator reasons
stale-owner-generation, stale-mapping-generation, wrong-mapping,
wrong-bar, and wrong-device. Each records side-effect-blocked; the proof
also records that the attached manager record is preserved, no fake mapping is
created, and no doorbell is written. This remains bounded proof evidence only:
production userspace handles, real MMIO mappings, real cache/write-policy
enforcement, and hostile stale-MMIO smokes remain open.
Implementation note, 2026-05-03 00:36 UTC: the schema and kernel now include a
result-only DeviceMmio.info skeleton that can wrap a manager-issued
DeviceMmioHandle. The object validates the live device-manager record
through validate_devicemmio_record() before returning status labels such as
userspaceDeviceMmio=manager-issued-skeleton,
managerRecord=validated-active, realMmioMapping=not-programmed,
realDoorbell=not-written, hostPhysicalUserVisible=false,
directMmio=blocked, registerRead=blocked, registerWrite=blocked, and
bootstrapGrant=blocked. The QEMU device-manager lifecycle proof constructs
that cap object while the attached record is active, records
devicemmio_cap_info_result=ok, exercises the serialized
CapObject::call(0, &[]) path and decodes the returned DeviceMmio.info
Cap’n Proto result as devicemmio_cap_serialized_call_result=ok, then
verifies the same cap fails closed after revoke begins as
devicemmio_cap_stale_after_revoke_result=devicemmio-stale-handle; the same
stale object also fails the serialized method-0 path as
devicemmio_cap_serialized_stale_after_revoke_result=invoke-failed. A later
manifest-grant smoke explicitly releases the granted DeviceMmio cap through
CAP_OP_RELEASE and proves a subsequent typed DeviceMmio.info call fails
closed from userspace. A focused grant-cycle smoke now repeats that grant,
release, and stale-info proof twice in sequence and asserts the second
manager-grant-source acquire receives a fresh mapping generation after the
first release; the same smoke also decodes both acquire/release cycles through
the typed volatile HardwareAuditLog.snapshot surface. The focused
hardware-audit interrupt-waiter smoke also decodes recent boot-time
DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable
lifecycle records from the current volatile 16-record snapshot window. The same
smoke now uses the startSequence cursor to decode older retained
DeviceMmio lifecycle rows that the default latest 16-record tail has skipped.
A 2026-05-10 15:33 UTC manifest-grant follow-up turns
DeviceMmio.map from admission-only into a read-only userspace VMA over the
boot-preseeded BAR page already used by brokered read32. The typed smoke
validates the active DeviceMmioOperation::Map authority check, rejects
writable, executable, unknown, zero-size, unaligned, out-of-BAR, and overflow
requests with typed no-side-effect results, reads the same QEMU BAR value
through the returned userspace address and DeviceMmio.read32, rejects a
duplicate active map, explicitly calls DeviceMmio.unmap, proves a second
unmap is a typed no-op, remaps, and proves stale unmap fails closed after cap
release. Release/drop/driver-crash/reset-disable cleanup revokes any borrowed
user VMA before detaching the manager record. This is read-only BAR VMA
evidence only: it does not add writable MMIO, volatile register writes,
doorbells, host physical/IOVA exposure, post-userspace kernel MMIO mappings,
IOMMU programming, durable/signed audit persistence, concurrent sharing
semantics, or a production provider-driver consumer. A 2026-05-10 20:06 UTC
follow-up promotes DeviceMmio.write32 to one bounded brokered volatile dword
write through that same boot-preseeded kernel MMIO cache after active
manager-attached handle, owner/state, policy/region, pure Write authority,
dword alignment, decoded-BAR range validation, and a single provider-scoped
claim derived from PCI MSI-X metadata, including BDF, BAR, BAR base, offset,
and value. The focused proof writes the claimed virtio-rng MSI-X entry-0
vector-control mask dword,
reads it back through both brokered read32 and the read-only userspace VMA,
then proves an unclaimed
message-address dword write leaves the original value unchanged. Invalid range
and unclaimed calls remain typed no-write results, while stale or released
handles fail closed before any write and do not return a write32 result
payload. This does not add writable userspace BAR mappings, arbitrary register
writes, doorbells, host physical/IOVA exposure, post-userspace arbitrary
remaps, IOMMU programming, or a production provider-driver consumer.
A 2026-05-26 07:32 UTC follow-up
(ddf-userspace-writable-devicemmio-interrupt) proves the cross-authority
non-implication required by the gate (“holding one authority must not imply
either of the others”). Each grant smoke’s granted CapSet is its sole
authority source: the DeviceMmio smoke holds only console + device_mmio
and the Interrupt smoke only console + interrupt. The smokes assert that
the other DDF grants (interrupt/device_mmio and dmapool) are absent from
the CapSet, and that the held cap cannot be reinterpreted as another
interface because the kernel-delivered interface id is fixed at grant time
(negative-authority ... result=ok lines, with the kernel “spawned … 2 caps”
structural counterpart). This is a non-implication proof over the
already-landed authorities; it adds no new kernel MMIO/IRQ/DMA surface. Real
userspace wait/acknowledge over the live route with deferred LAPIC EOI is
proven separately by the provider tx_interrupt/rx_interrupt consumer
(make run-ddf-provider-consumer).
DMAPool Invariants
DMAPool is the only future userspace-facing authority that may cause a
device-visible DMA address to exist.
- Authority: A holder may allocate buffers only from the pool object it was granted. It may not request arbitrary physical frames, import caller virtual memory by address, or derive another pool.
- Handle identity: A pool operation checks the claimed device id, owner generation, pool id, and pool generation before changing pool state. Buffer operations additionally check the buffer slot and slot generation before descriptor validation, completion accounting, free, scrub, or reuse.
- Physical range: Every exported device address must resolve to pages owned by the pool. The kernel records the allowed host-physical page set and validates every descriptor mapping against that set before a device can use it. If an IOMMU domain backs the pool, the exported address is an IOVA, not raw host physical memory.
- Ownership: Each DMA buffer has one pool owner, one device-domain owner, and explicit CPU mappings. Sharing a buffer with another process requires a later typed memory-object transfer; copying packet data is the default until that object exists.
- No raw grants: Userspace never receives an unrestricted host-physical
address. A driver may receive an opaque DMA handle or an IOVA meaningful
only to its
DMAPool/device domain. It cannot turn that value into access to unrelated RAM. - Residency: DMA pages are committed before exposure to the device, resident for the entire device-visible lifetime, unswappable, and scrubbed before reuse by another owner.
- Bounds: Buffer length, alignment, segment count, and queue depth are bounded by the pool. Descriptor chains that point outside an allocated buffer, wrap arithmetic, exceed device limits, or reference freed buffers fail closed before doorbell writes.
- Revocation: Revoking the pool first quiesces the device path using it, prevents new descriptors, waits for or cancels in-flight descriptors, then removes IOMMU mappings or invalidates bounce-buffer handles before freeing pages.
- Reset: If in-flight DMA cannot be proven stopped, revocation escalates to device reset through the owning device object before pages are reused.
- Residual state: Pages returned from a pool are zeroed or otherwise scrubbed before reuse by a different owner. Receive buffers are treated as device-written untrusted input until validated by the driver or stack.
Device-visible memory authority is not ordinary MemoryObject authority.
FrameAllocator and MemoryObject must not become raw physical-address escape
hatches. A future shared-buffer transfer may share CPU-visible packet bytes
after validation, but it does not by itself grant IOVA creation, descriptor
programming, or device write authority.
For the in-kernel QEMU smoke, the kernel is the only DMAPool holder. The
same invariants apply internally even though no userspace capability object is
exposed yet.
Implementation note, 2026-04-24: the initial virtio-net transport uses kernel-owned frame-allocator pages for RX/TX split-virtqueue descriptor, available, and used rings plus the one-shot TX descriptor proof buffer, ARP TX buffer, ICMP TX buffer, smoltcp adapter/TCP TX buffers, and posted RX packet buffers. The smoltcp adapter copies completed RX frame bytes out of those device-written pages before handing them to the stack. Those pages are programmed into the device only by kernel code after modern PCI transport discovery and feature negotiation; no userspace process receives a DMA buffer, physical address, or BAR mapping.
Implementation note, 2026-05-02: the current QEMU virtio-net DMA path routes
those kernel-owned pages through a bounded device_dma pool ledger. The net
smoke proves live pool bytes, page counts, page-rounded MMIO mapping bytes,
config/RX/TX interrupt holds, RX/TX ring depths, and RX/TX
submission/completion and in-flight descriptor accounting. This is the first
kernel-owned DMAPool accounting proof; it does not expose userspace DMA,
MMIO, or interrupt handles and does not complete the production S.11.2
hostile-smoke gate.
Implementation note, 2026-05-02 06:59 UTC: the kernel-owned device_dma
ledger now has an explicit bounded budget/OOM policy for the current
virtio-net proof path: 32 DMA pages, 131072 DMA bytes, queue depth 8,
submission depth 8, four page-rounded MMIO mapping holds, 16384 MMIO mapped
bytes, and three interrupt holds. make run-net emits a scratch-ledger
device-dma: budget oom proof ... proof_result=ok line proving page and byte
allocation over budget, overlarge queue depth, duplicate and over-budget MMIO
holds, MMIO byte over budget, duplicate and over-budget interrupt holds, and
descriptor submission beyond queue depth all fail closed without mutating the
live virtio-net ledger; the proof also revalidates the live ledger. This is a
bounded prerequisite for the production DMAPool contract. It still does not
expose userspace DMAPool, DeviceMmio, or Interrupt handles, wire real
lifecycle hooks, program IOMMU remapping domains, or close the S.11.2 hostile
smoke matrix.
Implementation note, 2026-05-02 16:31 UTC: attached device-manager DMAPool
records now also carry that budget profile. The lifecycle and imported
live-accounting proofs read page, byte, queue-depth, submission-depth, MMIO
mapping, MMIO byte, and interrupt-hold budgets through the active
device-manager record plus the matching DmaPoolHandle. Queue and submission
depth remain per-queue limits; the manager proof records queue_count=2 plus
derived aggregate in-flight/submission budgets and checks imported virtio-net
accounting against those aggregate totals. This keeps the budget policy tied
to manager state but still does not create userspace DMAPool handles or
enforce budgets at userspace handle creation, transfer, or revoke.
Implementation note, 2026-05-02 21:13 UTC: the zero-live device-manager
DMAPool lifecycle proof now validates a proof-scoped tampered
AttachedDmaPoolBudgetPolicyRecord through the manager budget-policy helper
while the attached pool record is active. The tampered record uses the wrong
policy scope/source/label plus stricter page, byte, queue, in-flight, MMIO,
interrupt, and submission budgets, and it fails closed before it can be
treated as a usable policy. The QEMU proof records
budget_policy_tamper_result=fail-closed,
budget_policy_tamper_allocation=not-created,
budget_policy_tamper_ledger=not-mutated,
budget_policy_tamper_teardown=not-advanced, and
budget_policy_tamper_side_effect=side-effect-blocked. This is still
bounded metadata proof only: no userspace DMAPool handle is exposed, no
production userspace DMA page is allocated, freed, or reused, and no real DMA
teardown is claimed.
Implementation note, 2026-05-02 22:16 UTC: the manager-owned DMAPool
budget-accounting proof now fails closed on accounting over the attached
budget instead of only logging passive booleans. The positive zero-live and
imported-live budget_*_within_policy=true labels call a helper that
revalidates the active attached record, matching DmaPoolHandle, owner, active
state, and attached budget policy before accepting the record accounting.
While the zero-live record remains active, synthetic attached-accounting
candidates exceed buffer count, page count, byte count, and the current
aliased in-flight/submission total; each candidate fails closed before it can
be treated as usable manager state. The QEMU proof records exact overrun
reasons, no fake allocation, no ledger mutation, no teardown advancement, and
side-effect blocking. A proof-scoped over-budget attach candidate now fails
before pool generation allocation and records preserved generation state. This
remains bounded manager-record evidence only; later grant slices add the first
single-page bounce-buffer allocation/free authority, but multi-buffer
allocation, DMA mapping, descriptor execution, IOMMU programming, production
driver consumption, and S.11.2 hostile smokes remain open.
Implementation note, 2026-05-02 23:59 UTC, updated 2026-05-03: the schema and
kernel now include a result-only DMAPool.info skeleton that can wrap a
manager-issued DmaPoolHandle. The object validates the live device-manager
record through validate_dmapool_record() before returning status labels such as
userspaceDmaPool=manager-issued-bounce-buffer, realDma=not-attempted,
hostPhysicalUserVisible=false, and directDma=blocked. The QEMU zero-live
device-manager lifecycle proof constructs that cap object while the attached
record is active, records dmapool_cap_info_result=ok, exercises the
serialized CapObject::call(0, &[]) path and decodes the returned
DMAPool.info Cap’n Proto result as
dmapool_cap_serialized_call_result=ok, then verifies the same cap fails
closed after revoke begins as
dmapool_cap_stale_after_revoke_result=dmapool-stale-handle; the same stale
object also fails the serialized method-0 path as
dmapool_cap_serialized_stale_after_revoke_result=invoke-failed. The same
proof now exercises DMAPool.allocateBuffer through call_with_table() on a
real DmaPoolCap in a CapTable: it decodes bufferIndex=0, verifies
CAP_CQE_TRANSFER_RESULT_CAPS, cap_count=1, the transfer-result record’s
DMABUFFER_INTERFACE_ID, the non-transferable same-session result-cap hold,
and DMABuffer.info through the returned result cap. Duplicate allocation and
stale-after-revoke allocation both fail closed without adding result caps. The
duplicate-active valid-size path now reports a structured schema result with
result=dmapool-already-attached, reason=active-buffer-attached,
sideEffect=side-effect-blocked, and bufferPresent=false; the duplicate path
also preserves the manager generation counter. Invalid-size requests use the
same no-result-cap response shape with
result=dmapool-allocation-request-invalid, the exact request reason,
sideEffect=side-effect-blocked, and bufferPresent=false.
As of the 2026-05-08 DMAPool grant-source follow-up, the same bounded path is
also available through KernelCapSource::DmaPool: the grant source attaches a
fresh zero-live manager-owned pool record, stages matching zero-live release
evidence, and lets the child mint one DMABuffer result cap. A 2026-05-09
release-order follow-up has the smoke explicitly release the parent DMAPool
before the result DMABuffer; the parent detach remains pending until the
DMABuffer frees the page and completes the staged zero-live pool detach. A
later 2026-05-09 follow-up adds a typed DMABuffer.freeBuffer method for that
bounded result cap: the method reuses the same FreeBuffer authority validation
and page scrub/ledger/frame-free cleanup path as cap release, emits a
free-buffer audit event, invalidates later DMABuffer.info, and makes the
later cap release a no-op detach. A second bounded follow-up keeps the parent
DMAPool live after that first explicit free, reallocates the same slot with a
fresh slotGeneration, and then repeats the parent-first release proof on the
second buffer. The focused read-side HardwareAuditLog.snapshot smoke also
decodes both slot generations, both typed free-buffer records, the parent
DmaPool release, and both no-op release-after-free records through the
volatile audit cap. The run-net
DMABuffer
driver-crash and reset-disable proofs also cover the same pending-parent
completion path so successful buffer cleanup cannot orphan the staged parent
release. A later admission follow-up adds typed
DMABuffer.submitDescriptor to the same manifest-granted bounce-buffer path:
the method validates the active manager-attached buffer epoch through the
existing DmaBufferOperation::SubmitDescriptor authority validator, echoes
the queue/descriptor/length and generation identity, and proves the same call
fails closed after freeBuffer revokes the old cap. A later symmetric
follow-up adds typed DMABuffer.completeDescriptor to the same bounded path.
The 2026-05-10 request-shaping follow-up routes both typed descriptor calls
through a shared pure bounded descriptor validator: valid bounce-buffer
requests return ok request labels plus queue_count=4,
descriptor_count=8, and buffer_bytes=4096, while out-of-range
queues/descriptors, zero submit lengths, submit lengths beyond the bounce
buffer, and completion lengths beyond the bounce buffer fail closed as
dmabuffer-descriptor-request-invalid with side-effect-blocked before any
descriptor side effect. A later manager-accounting follow-up records only the
bounded manager counter: submit returns manager-inflight-recorded and
raises DMAPool.info live_inflight to 1, completion returns
manager-inflight-completed and restores it to 0, and a valid completion
with no outstanding submission returns dmabuffer-no-inflight-submission.
Too-small descriptor result buffers are rejected before accounting mutation,
and cap-table release drains bounded in-flight accounting before detaching the
bounce buffer. The 2026-05-10 06:37 UTC follow-up makes this
allocateBuffer/freeBuffer page lifecycle the first production-labeled
single-page bounce-buffer allocation/free authority. The typed surfaces report
userspaceDmaPool=manager-issued-bounce-buffer,
allocation=single-bounce-buffer-page,
recordPool=userspace-bounce-buffer-live,
zero-live-dmapool-bounce-buffer, and freeBuffer=bounce-buffer-page; the
underlying device_dma ledger uses a manager-attached bounce-buffer helper and
scrubs before frame free. A 2026-05-10 11:44 UTC follow-up extends that bounded
manifest-granted path to two fixed manager-owned slots; a 2026-05-10
12:49 UTC follow-up extends the same path to three fixed manager-owned slots:
slot 0, slot 1, and slot 2 can be live together, DMAPool.info reports three
live buffers/pages while all are attached, a fourth allocation fails closed as
dmapool-already-attached, and slot 0 can be freed and reused with a fresh
generation while slots 1 and 2 remain live. There is still no allocation
beyond those three fixed slots, real device-visible DMA mapping, host physical
address or IOVA exposure, BAR mapping, production descriptor-ring mutation, CQ
publication, IOMMU programming, or production driver consumer.
Stale allocation attempts preserve the live backing, and page allocation
failure occurs before buffer-generation allocation so it does not burn a
generation.
The 2026-05-10 13:45 UTC follow-up (3bbeb3d4) adds explicit typed
DMABuffer.unmap for the mapped bounce-buffer userspace VMA. The method
validates the live DMABuffer handle before reporting success or no-op,
removes only the borrowed VMA owned by that mapping for the calling process,
and publishes the mapping as absent only after the borrowed-range ownership
check, page-table unmap, TLB wait, and waiter cleanup succeed. While teardown
is in progress, concurrent map/free/release paths fail closed against an
in-progress mapping state. A second unmap returns dmabuffer-mapping-absent /
no-user-mapping with no side effect. This is userspace VMA cleanup only: it
does not free or scrub the bounce page, detach the buffer record, change
DMAPool.info live buffer/page/in-flight counts, program or remove real DMA or
IOMMU mappings, expose host physical/IOVA addresses, mutate descriptor rings,
publish CQ entries, or add a production driver consumer.
The 2026-05-10 14:12 UTC follow-up moves bounded descriptor accounting from a
single pool-global descriptor identity to per-slot state on each live
manager-owned DMABuffer record. DMAPool.info live_inflight remains the
aggregate sum across live slots. A valid submit on slot 0 and a valid submit
on slot 1 can coexist; duplicate submit on either same slot still fails
closed; mismatched completion preserves that slot’s descriptor without
touching other slots; matching completion of slot 0 decrements the aggregate
while slot 1 remains in flight; explicit freeBuffer of an in-flight slot
fails closed; and cap-release/process-exit cleanup drains only the releasing
slot’s descriptor before detach. This is still bounded manager accounting and
does not mutate descriptor rings, publish CQ entries, expose host physical or
IOVA addresses, attempt direct DMA, program IOMMU state, or add a production
driver consumer.
The 2026-05-10 18:11 UTC follow-up makes the single manager-owned
bounce-buffer page exclusive between userspace borrowed-VMA ownership and
manager in-flight descriptor ownership for each DMABuffer cap. A valid
submit while the same cap still has a live mapping fails closed as
dmabuffer-mapping-live / user-mapping-live before manager in-flight
accounting changes; explicit unmap restores submit. A valid map while the
slot has an in-flight descriptor fails closed as dmabuffer-inflight-submission
/ in-flight-submission, returns addr=0, and does not publish a borrowed
VMA; matching completion restores map for that slot. The lock order remains
cap mapping state before manager validation, with no address-space lock held
across manager state mutation.
The 2026-05-10 19:29 UTC follow-up adds bounded completion data on that same
manager-owned bounce-buffer page. The successful matching
DMABuffer.completeDescriptor path keeps the existing
manager-inflight-completed result labels, validates the active owner,
pool/slot generation, queue/descriptor identity, and submitted length, then
writes a deterministic byte pattern into only the accepted completionLength
bytes before clearing the in-flight record. Because submit is blocked while a
cap-owned borrowed VMA is live and map is blocked while the slot is in flight,
the write happens while no live user VMA exists for that slot; a later
successful map lets userspace read the pattern. Invalid requests, stale caps,
no-inflight completions, descriptor mismatches, length-exceeded completions,
mapped-live cases, and after-free calls do not write. This is
userspace-visible bounce-buffer completion data, not device DMA completion:
there is still no descriptor-ring mutation, CQ publication, direct DMA,
host physical/IOVA exposure, IOMMU programming, or production driver consumer.
Implementation note, 2026-05-10 04:40 UTC: duplicate-active valid-size and
invalid-size DMAPool.allocateBuffer requests now use schema result data for
domain rejection instead of an application-exception label. The no-result-cap
response reports either result=dmapool-already-attached /
reason=active-buffer-attached or
result=dmapool-allocation-request-invalid with the exact request reason, plus
sideEffect=side-effect-blocked and bufferPresent=false, before any
allocation side effect.
The 2026-05-10 live-accounting follow-up also carries that bounded frame into
the attached DMAPool record exposed by typed DMAPool.info: the manager
record starts as zero-live-dmapool-bounce-buffer, changes to
userspace-bounce-buffer-live with one, two, or three live 4096-byte pages while
manager-attached DMABuffer result caps exist, and returns to zero-live after
typed DMABuffer.freeBuffer or cap release scrubs/releases every live bounce
page. The same lifecycle proof validates that active manager-record accounting
against the attached budget policy before treating allocation as usable state.
This still does not expose a device-visible DMA address, IOVA, host physical
address, production descriptor side effect, DMA mapping, or production driver
consumer.
Implementation note, 2026-05-11 03:00 UTC: the manifest-granted
manager-owned fixed bounce-buffer DMAPool path now has its own
device-manager budget policy instead of importing the live virtio-net
device_dma policy. The policy covers three live buffers/pages, 12288 bytes,
four queues, eight descriptors per queue, one in-flight descriptor per live
slot, zero MMIO mappings/bytes, and zero interrupt holds.
DMAPool.allocateBuffer validates the next-live manager accounting against
that policy before slot selection, frame allocation, generation allocation,
result-cap minting, or manager ledger mutation. With all three fixed slots
live, a fourth valid-size allocation returns no result cap and reports
result=dmapool-budget-exceeded, reason=over-buffer-budget,
sideEffect=side-effect-blocked, and bufferPresent=false. Imported live
virtio-net proof records continue to use the device_dma:virtio-net budget
policy. This remains the bounded manager-owned bounce-buffer path: direct DMA
is blocked, host physical addresses and IOVAs stay hidden, descriptor rings
and completion queues are not mutated, and IOMMU/remapping plus production
driver consumption remain out of scope.
Implementation note, 2026-05-11 06:10 UTC: the same fixed-slot
manager-owned DMAPool family now revalidates current budget accounting
before publishing an allocated DMABuffer result cap, before acquire-audit
publication, before parent DMAPool release intent or detach, before
grant-rollback/drop/teardown detach, before pending-parent release completion,
before DMABuffer page release, and before descriptor-completion state
advancement. The focused grant smoke labels the full-pool budget rejection as
no leaked result cap, no generation burn, no ledger mutation, and no stale
authority publication. DeviceMmio and Interrupt budget propagation is not
changed by this slice.
Implementation note, 2026-05-03 02:05 UTC: the schema and kernel now include a
result-only DMABuffer.info skeleton that can wrap the manager-attached
DmaBufferHandle record already issued inside the zero-live DMAPool
lifecycle proof. The object validates the live manager-owned buffer record
through validate_dmabuffer_record() and the existing pure DMA-buffer
validator before returning status labels such as
userspaceDmaBuffer=manager-issued-bounce-buffer,
managerRecord=validated-active, bufferRecord=manager-attached-buffer,
realDma=not-attempted, hostPhysicalUserVisible=false,
directDma=blocked, descriptorSubmit=manager-inflight-accounting,
descriptorComplete=manager-inflight-accounting, freeBuffer=bounce-buffer-page, and
bootstrapGrant=blocked. The QEMU zero-live lifecycle proof constructs that
cap object while the buffer record is active, records
dmabuffer_cap_info_result=ok, exercises the serialized
CapObject::call(0, &[]) path and decodes the returned DmaBuffer.info
Cap’n Proto result as dmabuffer_cap_serialized_call_result=ok, then verifies
the same cap fails closed after revoke begins as
dmabuffer_cap_stale_after_revoke_result=dmabuffer-stale-handle; the same
stale object also fails the serialized method-0 path as
dmabuffer_cap_serialized_stale_after_revoke_result=invoke-failed. Later
bounded grant slices expose this result cap through the DMAPool
manifest-grant path, add typed DMABuffer.freeBuffer, and add bounded
userspace bounce-buffer DMABuffer.map / DMABuffer.unmap plus
manager-accounted request-shaped DMABuffer.submitDescriptor /
DMABuffer.completeDescriptor; the path still exposes no real DMA,
descriptor-ring mutation, CQ publication, host physical address or IOVA export,
production page cleanup/reuse, or production userspace DMABuffer completion.
Implementation note, 2026-05-11 11:22 UTC: the provider-consumer smoke now
uses that same bounded bounce-buffer path to prove a
descriptor-ring-equivalent provider side effect. After
DMABuffer.submitDescriptor validates live owner/pool/slot authority,
descriptor queue/id/length bounds, no live user mapping, and no duplicate
in-flight descriptor, the manager scrubs the page and writes a provider-visible
shadow descriptor entry with magic, queue, descriptor id, submitted length, and
flags before writing the existing submit marker and committing the in-flight
record. DMABuffer.completeDescriptor still writes completion bytes only
inside the validated completion length, and the smoke proves the shadow entry
and marker remain visible outside that completion window. Provider-effect
submits shorter than the 24-byte shadow-descriptor-plus-marker footprint are
rejected as a typed no-side-effect boundary, even though the shared descriptor
request shape is otherwise valid. This is a bounded bounce-buffer side effect
only; no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA,
IOVA, host physical address, or remapping-domain claim is added.
Implementation note, 2026-05-11 12:01 UTC: the same submit path now replaces
the shadow-descriptor payload with a selected provider-owned queue entry plus
marker. The entry records queue magic, queue id, tail, descriptor id,
submitted length, and flags after descriptor authority validation and the
submit scrub; make run-ddf-provider-consumer maps the buffer after
completion and proves the queue entry and marker remain visible outside the
completion window. Provider-effect submits shorter than the current 72-byte
provider queue-entry-plus-marker footprint are rejected before in-flight
accounting or provider-visible mutation. This remains bounded bounce-buffer
evidence only:
no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA, IOVA,
host physical address, or remapping-domain claim is added.
Implementation note, 2026-05-12 20:30 UTC: the accepted
DMABuffer.submitDescriptor path now constructs the candidate in-flight
descriptor in manager-local state and validates the resulting
DMAPool budget/accounting before scrubbing or writing provider-visible
bounce-buffer bytes. The provider-consumer smoke snapshots the selected
provider queue-entry and marker bytes, drives a short provider-effect submit
rejection, and proves both those bytes and live_inflight=0 are preserved.
The selected virtio-net TX publication gate remains separately bounded after
provider-entry write: quiesced publication still fails with no extra pin and
no doorbell, not with rollback of the already-written shadow entry.
Implementation note, 2026-05-11 14:39 UTC, branch commit f04a14f4: the
selected provider-owned queue entry now carries a staged claimed virtio-net
notify-offset admission record instead of only a “requires claim” gate. The
selected queue 1 path records accepted notify-offset admission plus blocked
wrong-queue and wrong-offset admissions after descriptor authority validation
and submit scrub; a separate queue 0 submit proves non-selected queues remain
neutral and blocked from selected-backend doorbell metadata. This is not a
real doorbell path: no virtio-net notify BAR handle is granted, no notify
register is written, no real virtio-net descriptor ring is mutated, and
production userspace NIC readiness is not claimed.
Current QEMU evidence: the same make run-net path now
exports a bounded live-pool snapshot from the kernel-owned virtio-net
device_dma ledger and feeds it into a device-manager DMAPool record proof.
The live record carries buffer/page count, live bytes, current in-flight
submissions, committed/resident/unswappable flags, and scrub-before-release
policy. The device-manager proof rejects both DmaMappingsRemoved and teardown
detach while that authoritative ledger snapshot remains live. The proof now
calls the device_dma teardown-evidence API and records the expected
authoritative-ledger-live block with matching imported live accounting, then
reports completion as deferred because no authoritative zero-live/scrubbed
evidence is available for the live virtio-net ledger. It does not zero the
imported record to simulate teardown, does not claim DmaMappingsRemoved,
terminal Dead, or release for the live virtio-net pages, and does not scrub
or free live virtio-net DMA pages. This is still a prerequisite
record-accounting proof:
the current pages remain kernel-owned, only bounded info-skeleton hardware cap
grants are exposed to userspace, production page release hooks for live
virtio-net DMA are not wired, IOMMU remapping is not
programmed, and S.11.2 hostile smokes remain open.
The same smoke emits a separate device_dma scratch proof for the positive
zero-live teardown-evidence path: teardown_evidence() fails closed before
quiesce and scrub markers, rejects one-marker states, and reports
authoritative-ledger-zero-live only after both markers are set. The scratch
proof revalidates the live virtio-net ledger but does not mutate, zero, scrub,
free, or claim teardown completion for real virtio-net DMA pages.
Implementation note, 2026-05-03 03:21 UTC: the zero-live device-manager
DMAPool lifecycle proof now consumes that scratch zero-live evidence before
final pool detach, and binds that evidence to the attached record source. The
manager-owned zero-live record is labeled device-manager /
zero-live-dmapool-bounce-buffer; imported live records keep the source labels from
the authoritative device_dma snapshot. After the proof-scoped buffer record
is actively freed and detached, a zero-live pool detach with mismatched
virtio-net / kernel scratch evidence fails closed as
dmapool-zero-live-evidence-invalid, and detach without authoritative
evidence fails closed as dmapool-zero-live-evidence-absent. Only scratch
authoritative-ledger-zero-live evidence carrying the same record source plus
both quiesce and scrub markers allows the manager-owned record to detach and
the revocation path to advance to DmaMappingsRemoved, Dead, and release.
This remains scratch/no-real-DMA evidence: it does not tear down live
virtio-net pages, program or remove IOMMU mappings, expose userspace
DMAPool mapping/descriptor authority, or claim production page cleanup
beyond the bounded manager-attached bounce page.
The current scratch proof set also covers stale DMA page handles without
touching real virtio-net pages: reusing the same synthetic physical page bumps
the DMA page generation, the old handle fails closed as stale-dma-handle,
wrong-queue and wrong-label frees preserve the active page record, and duplicate
free remains rejected. Production userspace DMAPool stale-handle smokes,
descriptor-abuse coverage, revoke/reset races, and real quiesce/scrub/release
remain open.
Implementation note, 2026-05-02 23:23 UTC: the kernel-owned virtio-net
device_dma page release path now validates the DeviceDmaAllocation against
the live ledger before any scrub or frame-allocator call, scrubs the frame
through the HHDM mapping, removes the ledger entry only after scrub succeeds,
and then returns the frame to the allocator. The frame scrub helper checks
frame alignment, HHDM/allocator initialization, range, and allocated state
before zeroing the page. make run-net emits a bounded
device-dma: release scrub proof line using a proof-only kernel-owned page:
stale generation, wrong queue, and wrong label release attempts fail before
scrub, frame free, or ledger mutation; the active release path records
scrub_before_frame_free=true and ledger_removed_after_scrub=true. This is
still no-userspace-handle/no-real-teardown evidence for the current
kernel-owned virtio-net path only; production DMAPool handles, real
device-manager teardown hooks, IOMMU/bounce-buffer mapping removal, hostile
stale-DMA smokes, and full page lifecycle cleanup remain open.
Implementation note, 2026-05-02 12:28 UTC: the same scratch proof family now
covers stale DMA completion ordering without touching real virtio-net pages. A
synthetic reused DMA page slot bumps generation, stale completion validation
checks the page generation before queue-completion accounting, and the old
completion fails closed as stale-dma-handle before completion counters can
underflow or the reused page/submission state can mutate. This is still
prerequisite evidence only: production userspace DMAPool hostile smokes,
reset/revoke races with outstanding descriptors, CQ notification publication,
real quiesce/scrub/release, and IOMMU or bounce-buffer teardown remain open.
Implementation note, 2026-05-02 14:03 UTC: that scratch stale-DMA-completion
proof now adapts the synthetic DMA buffer slot into
capos-lib::device_authority before completion accounting can mutate. The
QEMU line records current-handle validation as ok, stale same-slot reuse as
stale-slot-generation, and side-effect-blocked, then preserves the
existing stale-dma-handle completion outcome. This remains a
scratch/no-real-DMA validator adapter, not production userspace DMAPool
authority or S.11.2 hostile-smoke completion.
Implementation note, 2026-05-02 15:15 UTC: the same scratch proof family now
adds a paired stale-completion publication check. A synthetic reset bumps the
device owner generation so an old completion fails as
stale-owner-generation with side-effect-blocked before any CQ publication.
The same-slot reuse path then fails the old completion as stale-dma-handle,
preserves the new submission accounting, and records both
cq_publication_blocked=true and new_owner_exposure_blocked=true. This is
still scratch/no-real-DMA evidence; production userspace DMAPool completion
notification, real hardware stale-completion injection, reset/revoke races,
and IOMMU or bounce-buffer teardown remain open.
Implementation note, 2026-05-02 12:46 UTC: the device-manager interrupt
handoff proof now includes a bounded stale IRQ after-detach check. After an
attached interrupt route is detached, the proof delivers the old LAPIC vector
through the dispatch path and requires stale_irq_delivery_after_detach to be
unregistered with stale_irq_wake_blocked=true; the old route handle also
continues to fail as interrupt-stale-route. This is prerequisite evidence
for route teardown ordering only. It does not cover a pending hardware IRQ
across reset, userspace Interrupt waiter wakeup semantics, or reassignment
reuse by a new owner.
Implementation note, 2026-05-02 14:24 UTC: the same device-manager interrupt
handoff proof now adapts the attached source into
capos-lib::device_authority before exercising the active wait path. After
revoke begins, the old handle fails the pure validator as
stale-owner-generation with side-effect-blocked, then the proof preserves
the existing interrupt-stale-route, detached-vector unregistered, and
stale_irq_wake_blocked=true checks. This remains a proof adapter, not
production userspace Interrupt handles, real waiters, reset/reassignment
reuse proof, or S.11.2 hostile-smoke completion.
Implementation note, 2026-05-02 14:51 UTC: the interrupt handoff smoke now
adds two bounded stale IRQ ordering points. After revoke begins and while the
route is still attached, delivery to the old vector remains masked, matches
the attached route generation, reports wake blocking, and leaves the old route
delivery count unchanged. During the reset phase, a synthetic route-registry
same-vector reuse proof re-registers and claims the route with bumped source
and route generations, then shows delivery to that vector is still masked,
matches the new route generation, and leaves the reused route delivery count
unchanged.
This is route-manager prerequisite evidence only: it is not a true pending
hardware MSI/reset hostile smoke, does not involve userspace Interrupt
waiters, and does not prove DMA buffer reuse race closure.
Implementation note, 2026-05-03 18:48 UTC: the interrupt handoff smoke now
snapshots a bounded pending IRQ token from the old vector, source id, source
generation, and route generation before revocation. Checking that token after
revoke blocks as stale-pending-irq-masked with reason route-masked; after
detach it blocks as stale-pending-irq-unregistered; and after reset/reuse it
blocks as stale-pending-irq-generation with reason
source-route-generation-mismatch. Each check records
side-effect-blocked, wake blocking, and unchanged delivery counts, and the
reset/reuse check records that the new route did not receive a delivery count.
The same proof rejects a malformed pending token with a zero generation as
stale-pending-irq-invalid-state before any delivery-count mutation.
This was bounded token-generation evidence only and did not inject a real
pending MSI across reset; the real-int $vector injection added at
2026-05-05 18:17 UTC (see below) closes the S.11.2.7 stale IRQ
hostile-smoke gate row by exercising the production CPU exception entry path
across the same revoke, detach, and reset/reuse boundaries.
Implementation note, 2026-05-09 18:12 UTC: the pending IRQ token decision path
now has a pure capos-lib::device_authority validator. Host tests cover
current-route acceptance and the same fail-closed label space used by the
kernel/QEMU proof: stale source generation, stale route generation, both
generations changed after reuse, source mismatch, route masked, route
unregistered, invalid route state, invalid owner, malformed zero-generation or
unassigned source identity, and unsupported vector. The kernel still snapshots
the live dispatch slot and delivery count, but delegates the pending-token
identity/state decision to that shared helper before returning
stale-pending-irq-* labels. This is validator/adapter coverage only; it does
not expose production userspace Interrupt waiters or wait/ack/mask/unmask
authority.
Implementation note, 2026-05-05 18:17 UTC: the interrupt handoff smoke now
fires a real INT $vector instruction at the device MSI vector at three
points across the revoke/reset/reuse boundary, exercising the production
IDT entry, extern "x86-interrupt" stub, record_lapic_delivery dispatch
slot read, and LAPIC EOI write rather than the helper-call path the prior
proofs used. The new proof scope strings drop “no-real-msi” and read
stale-vector-after-detach-real-int-vector-injected,
manager-attached-claimed-masked-after-revoke-real-int-vector-injected,
route-registry-vector-reuse-during-reset-real-int-vector-injected, and
bounded-pending-irq-token-generation-real-int-vector-cross-reset-injection.
Each injection point requires the slot’s delivery count to remain zero
before and after the real INT, and the post-INT outcome to match
masked (after revoke, route still attached but masked), unregistered
(after detach, slot cleared), and masked (after reset/reuse, slot now
belongs to a freshly registered+claimed route with bumped source and route
generations). The proof emits a closure summary
s11_2_7_proof_scope=s11-2-7-stale-irq-after-reset-real-int-vector-cross-reset-injection-no-userspace-waiter,
s11_2_7_real_irq_injected_across_reset=ok,
s11_2_7_old_waiter_cannot_wake_new_owner=true, and
s11_2_7_stale_ack_blocked=true. This closes the S.11.2.7 row of the
hostile-smoke acceptance matrix at make run-net (which invokes
tools/qemu-net-smoke.sh). It does not yet create a userspace Interrupt
waiter object; the in-flight delivery is observed via the kernel-owned
dispatch slot atomic state machine that the production path consumes.
S.11.2.9 hostile-smoke gate-wiring closed 2026-05-05 20:49 UTC (see
the implementation note below).
Implementation note, 2026-05-05 19:37 UTC: the device-manager hostile-smoke
suite now closes the S.11.2.8 stale-DMA-completion-after-reset row.
prove_qemu_stale_dma_completion_handoff claims a fresh probe-then-driver
record on the virtio-net PCI BDF (separate from the live virtio-net driver
state) and walks it through the same revoke, detach, and reset/reuse
boundaries S.11.2.7 uses. At each boundary the proof allocates a real
virtio-net DMA page through the production
device_dma::allocate_virtio_net_page helper, frees the page through
device_dma::free_virtio_net_page, reallocates so the live ledger’s page
generation advances, and synthesizes a stale DeviceDmaAllocation keyed to
the live phys with a decremented generation. The synthesized stale handle
is then fed to the production
device_dma::record_virtio_net_completion_for_allocation path – the same
function the live virtio-net Virtqueue::record_used_completion_for_allocation
invokes after descriptor tracking validates a hardware used-ring entry.
The production validator rejects each stale injection as stale-dma-handle
before any queue accounting decrement, completion side effect, CQ
publication, or new-owner memory exposure. The bounded run-net proof
records real_completion_inject_after_revoke_result=stale-dma-handle,
real_completion_inject_after_detach_result=stale-dma-handle,
real_completion_inject_after_reset_reuse_result=stale-dma-handle, all
three with side-effect-blocked, queue_account_preserved=true,
live_page_preserved=true, cq_publication_blocked=true,
new_owner_exposure_blocked=true, freed_buffer_unchanged=true, and
generation_bumped=true, plus a closure summary
s11_2_8_proof_scope=s11-2-8-stale-dma-completion-after-reset-real-free-realloc-cross-revoke-detach-reset-reuse-no-userspace-dmapool,
s11_2_8_real_completion_injected_across_reset=ok,
s11_2_8_old_completion_cannot_publish_to_new_owner=true,
s11_2_8_freed_buffer_reuse_blocked=true, and
s11_2_8_accounting_underflow_blocked=true. The new shape is enforced in
tools/qemu-net-smoke.sh and runs from make run-net. This is the
production paired stale-DMA-completion proof showing old completions cannot
publish stale CQ notifications or expose new-owner memory after real
revoke, detach, and reset/reuse boundaries with real free + realloc page
generation advances on the live kernel-owned ledger; S.11.2.9 hostile-smoke
gate-wiring closed 2026-05-05 20:49 UTC (see the implementation note
below). Userspace DMAPool handles and real device-manager page
quiesce/scrub/release hooks remain open as separate follow-ups.
Implementation note, 2026-05-05 20:49 UTC: the S.11.2.9 hostile-smoke
coverage row of the acceptance matrix is closed by aggregating every
matrix-row proof line into the make run-net -> tools/qemu-net-smoke.sh
gate. Every proof line referenced by the matrix has at least one
assertion in the harness today; the assertion shape varies by row.
The two driver-crash lines wired by that gate slice, the existing
S.11.2.8 device-manager: dma completion handoff proof closure-summary
line, and the S.11.2.7 device-manager: interrupt handoff proof
closure-summary line (whose trailing anchor was added by this slice
for harness-strictness consistency with S.11.2.8) all use anchored
extended-regex assertions (field-by-field match plus
proof_result=ok[[:cntrl:]]?$ trailing anchor); other matrix-row rows
reuse the harness’s pre-existing mix of unanchored extended-regex and
fixed-string grep -Fq assertions on the emitted proof lines. The two previously unasserted lines wired by this
slice are device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio and
device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt. Both proofs were
already emitted by the kernel on every boot (via
prove_qemu_devicemmio_driver_crash_hook and
prove_qemu_interrupt_driver_crash_hook in kernel/src/device_manager/proofs.rs)
and exercise the explicit driver-crash teardown trigger path with a stale
rerun noop, validate-live revoked cap state, and cap_release_after_crash
as noop. The chosen gate strategy keeps S.11.2.9 inside make run-net
rather than splitting into a separate make run-hostile-smokes target,
because all six matrix rows depend on the same virtio-net device bring-up
state (probe-then-driver records on the virtio-net BDF, real IDT vector
injection, real DMA page free + reallocate). A separate target would
duplicate the bring-up cost without adding coverage. Tightening the
remaining unanchored assertions to the same anchored shape is a
follow-up harness-hardening task and is not part of S.11.2.9 closure because
each affected proof line is still uniquely identified by its
emitted prefix and the asserted field set. Production userspace
DMAPool/DeviceMmio/Interrupt handles, real device-manager page
quiesce/scrub/release hooks, hardware-backed provider-driver Interrupt
wait/ack dispatch beyond the current bounded route-dispatch waiter proof,
durable/signed production audit consumption beyond the first volatile
HardwareAuditLog.snapshot cap, and IOMMU domain programming all remain open
as separate follow-ups tracked in
docs/backlog/hardware-boot-storage.md and the docs/tasks/README.md
userspace-driver-transition bullet.
Implementation note, 2026-05-08 09:44 UTC: the same make run-net gate now also
asserts cap-specific DMA driver-crash proofs. DmaBufferCap routes the
explicit trigger through the bounded FreeBuffer cleanup path and proves page
scrub/ledger/frame-free labels before stale rerun and post-trigger cap release
both return noop; DmaPoolCap routes the explicit trigger through the
zero-live evidence-gated detach path and proves authoritative zero-live,
quiesced, and scrubbed evidence labels before stale rerun and post-trigger cap
release return noop.
Implementation note, 2026-05-03 01:05 UTC: the schema and kernel now include a
result-only Interrupt.info skeleton that can wrap a manager-issued device
handle plus the attached DeviceInterruptRoute. The object validates the
live manager record, owner, claimed route, and attached route record through
validate_interrupt_record() before returning status labels such as
userspaceInterrupt=manager-issued-skeleton,
managerRecord=validated-active, routeRecord=manager-attached-route,
realInterruptDelivery=not-delivered, wait=admission-check-only,
acknowledge=admission-check-only, mask=route-state-control,
unmask=route-state-control, and bootstrapGrant=blocked. The interrupt handoff
QEMU proof constructs that cap object while the route record is active,
records interrupt_cap_info_result=ok, exercises the serialized
CapObject::call(0, &[]) path and decodes the returned Interrupt.info
Cap’n Proto result as interrupt_cap_serialized_call_result=ok, then verifies
the same cap fails closed after revoke begins as
interrupt_cap_stale_after_revoke_result=interrupt-stale-handle; the same
stale object also fails the serialized method-0 path as
interrupt_cap_serialized_stale_after_revoke_result=invoke-failed. A later
manifest-grant smoke explicitly releases the granted Interrupt cap through
CAP_OP_RELEASE and proves a subsequent typed Interrupt.info call fails
closed from userspace. A focused grant-cycle smoke now repeats that grant,
release, and stale-info proof twice in sequence and asserts the second
manager-grant-source acquire preserves the source generation and receives a
fresh route generation after the first release; the same smoke also decodes
both acquire/release cycles through the typed volatile
HardwareAuditLog.snapshot surface. The focused
hardware-audit interrupt-waiter smoke also decodes recent boot-time
DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable
lifecycle records from the current volatile 16-record snapshot window. The same
smoke now uses the startSequence cursor to decode older retained
DeviceMmio lifecycle rows that the default latest 16-record tail has skipped.
A 2026-05-09 19:18 UTC follow-up adds a bounded Interrupt.wait admission
method to that skeleton. The method validates the same manager-attached route,
snapshots a pending-token candidate, delegates to the shared
capos-lib::device_authority pending-IRQ validator, and returns typed labels
through capos-rt; the focused grant smoke asserts the current masked-route
result stale-pending-irq-masked, reason route-masked,
side-effect-blocked, matching token/current source and route generations,
unchanged delivery counts, no waiter wake, and fail-closed behavior after cap
release. This is bounded
manager-issued skeleton evidence only: there is no blocking wait,
real hardware acknowledgement, real hardware mask/unmask side effect,
interrupt delivery authority, real waiter object, production Interrupt completion,
durable/signed audit persistence, or concurrent sharing claim. A
2026-05-09 23:21 UTC follow-up adds bounded Interrupt.acknowledge
admission to the same skeleton. It validates the manager-attached route through
the existing Acknowledge authority path, returns
admission-check-only, interrupt-ack-not-attempted, and
side-effect-blocked, and proves delivery counts remain unchanged with no
waiter wake or hardware acknowledgement.
A 2026-05-09 23:52 UTC follow-up adds bounded Interrupt.mask and
Interrupt.unmask admission to the same skeleton. They validate the
manager-attached route through the existing Mask and Unmask authority
paths, return admission-check-only,
interrupt-mask-not-attempted / interrupt-unmask-not-attempted, and
side-effect-blocked, and prove route state and delivery counts remain
unchanged with no hardware mask/unmask, waiter wake, or IRQ delivery.
A 2026-05-10 04:01 UTC follow-up promotes those methods to bounded
route-state control over the manager-attached dispatch slot. Interrupt.unmask
now changes claimed-masked to driver-unmasked, Interrupt.mask changes it
back to claimed-masked, and both preserve delivery counts while still
avoiding hardware MSI/MSI-X table programming, waiter wakeup, hardware
acknowledgement, or real IRQ delivery.
A 2026-05-10 22:54 UTC follow-up wires real waiter completion to the
existing route-dispatch delivery counter from scheduler/poll context. The
poller observes matching delivered routes by vector, source generation, and
route generation without taking the waiter-table lock in the IRQ dispatch path,
then revalidates the manager-attached route under the route-post exclusion
before posting the deferred cap completion. The focused grant smoke proves the
first unmasked manifest-granted wait completes as interrupt-delivered /
waiter-completed-irq with real_interrupt_delivery=delivered and an
advanced delivery count, while a second unmasked wait still remains pending
until Interrupt.mask completes it through the existing
interrupt-waiter-cancelled / waiter-completed-no-irq path. Stale, masked,
released, reset, or reused routes do not wake as delivered IRQs. This remains a
bounded route-dispatch proof; it does not program hardware MSI/MSI-X tables,
acknowledge hardware, add provider-driver interrupt consumption, or claim
hostile hardware isolation.
Implementation note, 2026-05-03 13:49 UTC: the result-only
DMAPool.info, DMABuffer.info, DeviceMmio.info, and Interrupt.info
surfaces now return numeric identity fields alongside the conservative status
labels. The fields mirror the documented handle identity shape: deviceId,
BDF bus/device/function, owner generation, and the relevant pool id/generation,
buffer slot/generation, BAR/mapping id/generation, or interrupt
source/generation/route generation. The QEMU proof logs and net smoke assert
the active serialized method-0 decode for those fields, while stale method-0
calls still fail closed as invoke-failed. This remains a result-only
manager-issued skeleton surface; it does not add production DMA allocation,
free, map, submit, or completion authority, production MMIO mapping or
doorbell authority, production interrupt wait/ack/mask/unmask authority, real
DMA page cleanup/reuse, hostile hardware isolation, or S.11.2 completion.
Implementation note, 2026-05-03 16:37 UTC: the bounded interrupt route
identity skeleton now carries separate source and route generations end to end.
DeviceInterruptRoute, LegacyIoApicInterruptRoute, route records,
diagnostics summaries, dispatch-slot metadata, and the device-manager attached
interrupt bridge store both generations. Registration allocates both fields,
PCI MSI-X route reassignment preserves source generation while bumping only the
route generation, release/re-register allocates both generations fresh, and
Interrupt.info returns the independent values. The QEMU net smoke asserts the
split in PCI and legacy route logs, metadata proof logs, handoff proof logs,
serialized Interrupt.info, and cap-release proof logs. This closes only the
bounded identity proof gap; it does not expose production userspace
Interrupt authority, create real waiters, or complete the S.11.2 hostile IRQ
smoke matrix.
DeviceMmio Invariants
DeviceMmio is register authority, not memory authority.
- Authority: A holder may map only BARs or subranges recorded in the claimed device object. It may not map PCI config space globally, another function’s BAR, RAM, ROM, or synthetic kernel pages.
- Handle identity: Each call checks the claimed device id, owner generation, BAR or subrange mapping record, and mapping generation before mapping, unmapping, reading, or writing registers.
- Physical range: Each mapping is bounded to the BAR’s decoded physical range, page-rounded by the kernel, and tagged as device memory with cache attributes appropriate for MMIO. Partial BAR grants must preserve page-level isolation; otherwise the grant must cover the whole page-aligned register window and be treated as that much authority.
- Ownership: At most one mutable driver owner controls a device function’s
MMIO at a time. Management capabilities may inspect topology, but register
writes require the claimed
DeviceMmioobject. - No DMA implication: Mapping registers does not grant any DMA buffer,
frame allocation, interrupt, or config-space authority. Doorbell writes are
accepted only as effects of register access; descriptor validity is enforced
by
DMAPoolbefore queues are made visible to the device. - Revocation: Revocation unmaps the driver’s register pages, marks the device object unavailable for new calls, and invalidates outstanding MMIO handles. Stale mappings or calls fail closed.
- Reset: Revoking the final mutable
DeviceMmioowner resets or disables the device unless a higher-level device manager explicitly transfers ownership without exposing it to an untrusted holder.
Interrupt Invariants
Interrupt is event authority for one routed source.
- Authority: A holder may wait for, mask/unmask where supported, and acknowledge only its assigned vector, line, or MSI/MSI-X table entry. It may not reprogram arbitrary interrupt controllers or claim another source.
- Handle identity: Each wait, mask, unmask, and acknowledge checks the claimed device id, owner generation, source id, source generation, route generation, and any live waiter generation before affecting delivery state.
- Ownership: Each interrupt source has one delivery owner at a time. Shared legacy lines must be represented as a kernel-demultiplexed object with explicit device membership, not as ambient access to the whole line.
- Range: The capability records the hardware source, vector, trigger mode, polarity, and target CPU/routing state. User-visible operations are checked against that record.
- Revocation: Revocation masks or detaches the source, drains pending notifications for the old holder, invalidates waiters, and prevents stale acknowledgements from affecting a new owner.
- Reset: If the source cannot be detached cleanly, the owning device is reset or disabled before the interrupt is reassigned.
- No MMIO or DMA implication: Interrupt delivery does not grant register access, DMA buffers, or packet memory.
Revocation Ordering
Device revocation must follow a fixed order:
- Stop new submissions by invalidating the driver’s user-visible handles.
- Revoke MMIO write authority by write-blocking or unmapping BAR pages, or by disabling the device before any DMA teardown starts.
- Mask or detach interrupts.
- Quiesce virtqueues or device command queues.
- Reset or disable the device if in-flight DMA cannot be accounted for.
- Remove IOMMU mappings or invalidate bounce-buffer handles.
- Scrub and free DMA pages.
This order prevents a stale driver from racing revocation with doorbell writes, interrupt acknowledgement, or descriptor reuse. Logical handle invalidation is not sufficient while a BAR remains mapped; register-write authority must be removed or the device must be disabled before descriptor or DMA-buffer ownership is reclaimed.
Implementation should represent the order as an explicit device-owner state machine rather than as ad hoc booleans:
#![allow(unused)]
fn main() {
enum DeviceOwnerState {
Active,
RevokingHandles,
MmioRevoked,
InterruptsDetached,
QueuesQuiesced,
Resetting,
DmaMappingsRemoved,
Dead,
}
}
No path may free or reassign DMA pages until the state has reached
QueuesQuiesced with all in-flight descriptors accounted for, or Resetting
has completed and the device can no longer write old buffers. Dead means all
user-visible handles are invalid, interrupts are detached or masked, DMA
mappings are removed, and pages have been scrubbed or transferred to a trusted
owner.
Hard invariants:
- DMA pages cannot be freed before
QueuesQuiescedor a completedResettingtransition proves old DMA writes are stopped. - MMIO write authority must be revoked before DMA ownership teardown.
- Interrupt reassignment cannot happen before old pending notifications are drained or generation-invalidated.
- Device reset is mandatory if in-flight DMA cannot be proven stopped.
Future Userspace-Driver Transition Criteria
Moving NIC or block drivers out of the kernel is gated by Security Verification Track S.11.2. The gate is only open when all rows below are implemented and demonstrated. The S.11.2.N labels are local checklist row IDs for this gate.
The completed Device Driver Foundation selected milestone used this track as
the prerequisite for the DMAPool, accounting, and hostile-smoke sub-gate.
Future DDF follow-ups still use these rows as the userspace-driver transition
gate: generic MSI/MSI-X dispatch and second-device reuse may land first, but
userspace DeviceMmio and Interrupt exposure stays blocked until these rows
pass.
Production DMAPool Ledger Prerequisite
Before userspace NIC or block drivers receive DeviceMmio, Interrupt, or
DMAPool handles, the device manager must own one ledger of record for each
claimed device. That ledger is the authoritative source for every
device-visible hold, not a diagnostic mirror of separate subsystems.
The ledger records:
- DMA pool bytes reserved and live;
- DMA buffer count, slot generation, and owner generation;
- mapped userspace DMA VMAs, quiesce state, scrub state, and release eligibility for each attached DMA pool;
- descriptor and ring depth limits, including live in-flight submissions and completions;
- page-rounded MMIO mappings and their owning
DeviceMmiogenerations; - interrupt holds, waiter generations, and routed-source generations;
- budget and OOM policy for allocation, queue growth, mapping, and interrupt attachment;
- teardown state in the device-owner state machine.
Every operation that creates, consumes, or releases device-visible authority must update this ledger as part of the same ownership transaction that changes device-manager state. That includes DMA buffer allocation/free, descriptor submission, completion accounting, BAR mapping/unmapping, interrupt attach/detach, reset, revoke, process exit, and capability release.
Implementation note, 2026-05-03 13:18 UTC: the QEMU virtio-rng metadata path
now runs a bounded teardown-trigger proof for cap-release, process-exit,
driver-crash, reset-disable, interrupt-waiter, future-devicemmio, and
future-dmapool. Each trigger row sequentially claims and transfers the same
PCI function, begins revocation, walks the existing device-owner state machine
to Dead, releases only after Dead, and proves generation bumps, stale
handle rejection, direct state-skip rejection, pre-Dead release rejection,
and per-trigger coverage without duplicates. The cap-release row attaches a
bounded manager-owned DeviceMmio record to the active driver handle, removes a
DeviceMmioCap from a cap table, runs the CapOpRelease hook, and records
cap-table removal plus detached/stale manager validation before normal
revocation. The process-exit row attaches the same bounded DeviceMmio
record shape to a real proof Process, runs
Process::release_caps_for_exit(), and records cap-table removal plus
detached/stale manager validation before normal revocation. The driver-crash,
reset-disable, and interrupt-waiter rows register and claim bounded PCI
MSI-X lifecycle-probe routes, attach them to the device manager, prove
InterruptsDetached is blocked as interrupts-attached, detach and release
the routes while still in MmioRevoked, and then advance normally.
The future-devicemmio row attaches a bounded manager-owned DeviceMmio
record from the first decoded PCI memory BAR, proves MmioRevoked is blocked
as devicemmio-attached, detaches while still in RevokingHandles, and then
advances normally. The future-dmapool row attaches a bounded zero-live
DMAPool record, proves DmaMappingsRemoved is blocked as
dmapool-attached, detaches while still in Resetting, and then advances
normally. The generic teardown-trigger summary reports no label-only rows
and seven object-backed rows, while the route-aware interrupt handoff smoke
also labels the claimed MSI-X route as bounded interrupt-waiter blocker
evidence: interrupt_waiter_object=interrupt-route-record,
interrupt_waiter_block_state=InterruptsDetached,
interrupt_waiter_block_result=interrupts-attached,
interrupt_waiter_detach_result=ok, and
interrupt_waiter_route_generation_preserved=true. This bounded
route-record evidence is contract proof for the shared ownership transaction
only: it does not expose production userspace authority handles, real MMIO,
real DMA, a userspace waiter, or production crash/reset observers. Separate
first DeviceMmioCap,
InterruptCap, DmaPoolCap, and DmaBufferCap release-hook proofs now
exercise both the production ring CAP_OP_RELEASE dispatch path and a real
Process::release_caps_for_exit() path for those cap objects, validating
cap-table removal plus exact manager-owned DeviceMmio detach,
manager-attached interrupt-route release, bounded zero-live DMAPool detach,
or proof-owned DMA-buffer record cleanup. The generic route-record trigger rows
and remaining DMA production work do not yet implement production observers,
production interrupt-waiter objects, userspace DeviceMmio, production
userspace DMAPool/DMABuffer authority, full device authority, or true
pending hardware MSI/reset-hostile route teardown.
Implementation note, 2026-05-08 10:08 UTC: the first cap-specific
reset/disable trigger entry points now exist for DeviceMmioCap and
InterruptCap. trigger_reset_disable_for_devicemmio and
trigger_reset_disable_for_interrupt route through the same idempotent
stale-safe detach helpers as cap release and driver-crash cleanup, emit one
cap-audit: ... event=reset-disable detach=ok line on the first successful
trigger, and keep stale reruns silent. This is still bounded trigger plumbing:
the reset/disable observer, non-proof DMA cleanup integration, userspace
MMIO/interrupt operations, and IOMMU-backed remapping work remain future
requirements.
Implementation note, 2026-05-08 10:39 UTC: the DMA caps now have the matching
cap-specific reset/disable trigger plumbing. DmaPoolCap::on_reset_disable
uses the same authoritative zero-live/quiesced/scrubbed evidence-gated detach
as cap release and driver-crash cleanup. DmaBufferCap::on_reset_disable
reuses the bounded FreeBuffer authority validation and page-scrub/frame-free
cleanup path, then leaves the parent pool attached until staged zero-live
cleanup. make run-net asserts the dmabuffer-reset-disable-hook and
dmapool-reset-disable-hook proof lines, stale rerun noop, revoked
cap validation, post-trigger release noop, and exact-one
cap-audit: cap={dmabuffer,dmapool} event=reset-disable lines. This is still
proof-owned no-real-DMA cleanup; production userspace DMAPool/DMABuffer
authority and non-proof page lifecycle integration remain future work.
Budget or OOM failure is closed before the driver can observe a new handle, program a descriptor, map MMIO, attach an interrupt, or ring a doorbell. A failed submission must leave no live descriptor hold behind, or must leave an explicit in-flight record that teardown can drain or reset. A completed teardown must reconcile the ledger to zero live DMA buffers, zero live MMIO mappings, zero interrupt holds, and no in-flight descriptor submissions for the released device generation.
Implementation note, 2026-05-02 06:59 UTC, updated 2026-05-11 06:10 UTC: the
current kernel-owned virtio-net ledger now proves the closed budget/OOM cases
above with a scratch ledger and the live ledger validation described earlier.
Imported live device-manager DMAPool records still preserve the
device_dma:virtio-net source policy and prove imported live accounting stays
within its aggregate in-flight budget while preserving that policy’s per-queue
queue/submission depth limits. The manifest-granted manager-owned
bounce-buffer DMAPool path now attaches its own device-manager budget policy
to userspace DMAPool.allocateBuffer handle creation and the current
fixed-slot DMAPool/DMABuffer transfer, release, pending-release, drop,
rollback, teardown-detach, page-release, and descriptor-completion cleanup
paths. The full eight-slot pool fails as dmapool-budget-exceeded /
over-buffer-budget before allocation, cap minting, or ledger mutation, and
the selected release paths revalidate current or next accounting before
advancing manager-owned state. Production userspace DMAPool records must
still attach budget checks to broader provider-driver transfer/revoke/reset
transactions, IOMMU or direct-DMA mapping state, and non-fixed-slot
allocation before this row can be treated as the complete userspace-driver
transition gate.
Implementation note, 2026-05-02 08:33 UTC: the QEMU virtio-rng metadata path
now runs a bounded DMAPool record lifecycle proof on the device-manager
teardown state. The first slice keeps the record zero-live: it records a pool
slot, pool generation, and owner generation, rejects stale and
owner-mismatched attach attempts, rejects duplicate attachment, and proves that
begin_revocation invalidates the user-visible pool handle by bumping the
device owner generation. The ordered teardown path now fails closed with
dmapool-attached if it tries to enter DmaMappingsRemoved while the pool
record remains attached. The current continuation also proves that the revoke
handle cannot detach the zero-live pool without scratch authoritative
zero-live, quiesced, and scrubbed evidence bound to that record’s source; a
mismatched scratch source is rejected before detach. With matching
proof-scoped evidence, the record detaches after queues are quiesced/reset and
before DmaMappingsRemoved. Later bounded manifest grants expose conservative
DMAPool, DeviceMmio, and Interrupt surfaces; the current DMAPool grant
can mint only eight fixed manager-attached proof DMABuffer result caps. The
remaining gap is production userspace authority, allocation beyond those eight
fixed slots, real device-visible page allocation through the device manager,
non-proof DMA page lifecycle integration, IOMMU remapping, and the S.11.2
hostile smoke matrix.
Current QEMU evidence: the QEMU virtio-net path now adds
the corresponding imported live-accounting prerequisite proof. A
device-manager DMAPool record is attached with accounting derived from the
live device_dma ledger: live buffer/page count, live bytes, current in-flight
submissions, committed/resident/unswappable residency flags, and
scrub-before-release policy. DmaMappingsRemoved fails closed with
dmapool-attached while the record remains attached, direct teardown detach
fails closed with dmapool-live while the authoritative ledger remains live,
and the live proof consumes the device_dma teardown-evidence API, observes
authoritative-ledger-live with matching imported live accounting, and
explicitly defers completion with no real DMA teardown attempted. The same
proof path now validates the imported DMAPool record through
capos-lib::device_authority for the active handle and stale-after-revoke
failure labels.
This does not create production userspace handles, real page-release hooks,
IOMMU mapping invalidation, scrubbed release, terminal Dead, or
hostile-smoke coverage for the live virtio-net record. The companion
scratch-ledger proof covers the positive zero-live teardown-evidence result
without claiming that the live virtio-net record has been torn down. The
manager-owned zero-live lifecycle proof consumes matching-source
device-manager teardown evidence for the positive detach/DmaMappingsRemoved
path and separately proves mismatched-source and missing-evidence detach
attempts fail closed. The manifest-granted bounded DMAPool path now keeps
mapped userspace VMA count, in-flight descriptor holds, residency,
quiesce/scrub state, and release eligibility in that manager record. Borrowed
or device-visible pages remain committed, resident, unswappable,
generation-bound, and unavailable for reuse until the manager record is
zero-live, unmapped, quiesced, and scrubbed. Descriptor submission is refused
while a buffer is borrowed to userspace, and release consumes manager-owned
teardown evidence instead of proof-only device_dma zero-live evidence.
The corresponding DMAPool.info ABI reports mapped VMA count, quiesce state,
scrub state, and release eligibility for QEMU proof assertions. This is still
bounded bounce-buffer lifecycle authority only: direct DMA, host physical or
IOVA exposure, IOMMU/remapping, production provider-driver consumption,
durable audit, and broader transfer/revoke policy remain future work.
| Gate item | Required state | Must-have proof |
|---|---|---|
| S.11.2.0 DMA-owned buffers | DMAPool owns every driver-visible DMA mapping. | A driver receives opaque buffer handles or IOVA-only values; no path hands out raw host physical addresses. |
| S.11.2.1 Bound checks | Allocation, descriptor chain length, alignment, segment length, and ring depth are bounded and constant-time validated before ring submission. | Ring submissions fail closed on overflow, wrap, stale-handle, and freed-handle reuse attempts. |
| S.11.2.2 Explicit remap/ownership | DeviceMmio can only grant claimed BAR pages; cache attributes and write policy are enforced. | Driver cannot access unclaimed BARs, ROM, RAM pages, config-space globals, or stale mappings after revoke. |
| S.11.2.3 Interrupt correctness | Interrupt owns exactly one logical source at a time and drains/waits only for that source. | Reassigning an owner invalidates old waiters and masks or detaches the source first. |
| S.11.2.4 Quiesce + reset contract | Device manager can force reset/disable on failed revoke or teardown. | No in-flight descriptor may continue touching freed buffers after driver removal. |
| S.11.2.5 Process lifecycle | Capability release, process exit, and process-spawn cleanup paths cannot leak DMA pages/MMIO/intr ownership. | Crash-path teardown removes holds and invalidates user-visible handles before page free. |
| S.11.2.6 Isolation and accounting | Security Verification Track S.9 quota and authority ledgers include DMA, MMIO, and interrupt hold edges. | A malicious or buggy driver cannot consume more than its allocated authority budget. |
| S.11.2.7 Stale IRQ ordering | Stale interrupt delivery after revoke cannot wake, acknowledge, or signal a new owner. | Interrupt generation mismatch is ignored, or the source is masked/detached/reset before reassignment. Hostile smoke revokes a driver while an interrupt is pending, reassigns the source, and proves the old waiter cannot wake against the new owner. Closed 2026-05-05 18:17 UTC by make run-net’s device-manager: interrupt handoff proof line: real INT $vector injection across revoke, detach, and reset/reuse exercises the production IDT entry/handler/EOI path, asserts s11_2_7_real_irq_injected_across_reset=ok, s11_2_7_old_waiter_cannot_wake_new_owner=true, and s11_2_7_stale_ack_blocked=true, and is enforced by tools/qemu-net-smoke.sh. Userspace Interrupt waiter objects remain a future requirement for a full production-driver path. |
| S.11.2.8 Stale DMA completion ordering | Stale DMA completion after revoke cannot cause freed buffer reuse, stale CQ notification, or new-owner memory exposure. | Closed 2026-05-05 19:37 UTC by make run-net’s device-manager: dma completion handoff proof line: real virtio-net DMA page free + reallocate cycle bumps the live page generation, then the production device_dma::record_virtio_net_completion_for_allocation path (the same function the live Virtqueue::record_used_completion_for_allocation invokes) is fed a stale DeviceDmaAllocation keyed to the live phys with a decremented generation, at three boundaries (after revoke, after detach, after reset/reuse). All three reject as stale-dma-handle with side-effect-blocked, queue accounting unchanged, live new-owner page preserved, no CQ publication, no new-owner exposure, and the freed-buffer slot remaining unchanged. The closure summary asserts s11_2_8_real_completion_injected_across_reset=ok, s11_2_8_old_completion_cannot_publish_to_new_owner=true, s11_2_8_freed_buffer_reuse_blocked=true, and s11_2_8_accounting_underflow_blocked=true, and is enforced by tools/qemu-net-smoke.sh. Prior acceptance text: in-flight DMA is accounted for, or device reset/disable completes before buffer reuse; hostile smoke covers revoke/reset with outstanding descriptors and proves no old completion can publish new-owner memory. S.11.2.9 hostile-smoke gate-wiring also closed 2026-05-05 20:49 UTC (see the row below). Userspace DMAPool handles and real device-manager page quiesce/scrub/release hooks remain open as separate follow-ups. |
| S.11.2.9 Hostile-smoke coverage | QEMU/CI smokes cover stale handles, descriptor abuse, revoke races, stale IRQ after reset, stale DMA completion after reset, and exit-under-dma. | Smoke output has explicit closed-case proof lines for each above failure mode. Closed 2026-05-05 20:49 UTC by aggregating the existing per-row proof lines into the make run-net -> tools/qemu-net-smoke.sh gate. Every matrix-row proof line has at least one assertion in the harness; the original two driver-crash assertions, the existing S.11.2.8 device-manager: dma completion handoff proof closure-summary assertion, and the S.11.2.7 device-manager: interrupt handoff proof closure-summary assertion (whose trailing anchor was added by this slice for harness-strictness consistency with S.11.2.8) all use the anchored extended-regex shape (field-by-field match plus proof_result=ok[[:cntrl:]]?$ trailing anchor), and the other matrix-row rows reuse the harness’s pre-existing mix of unanchored extended-regex and fixed-string grep -Fq assertions. A 2026-05-08 09:44 UTC follow-up adds anchored assertions for the cap-specific dmabuffer-driver-crash-hook and dmapool-driver-crash-hook proof lines; a 2026-05-08 10:08 UTC follow-up adds anchored assertions and exact-one audit counts for the first cap-specific devicemmio-reset-disable-hook and interrupt-reset-disable-hook proof lines; a 2026-05-08 10:39 UTC follow-up does the same for dmabuffer-reset-disable-hook and dmapool-reset-disable-hook; a 2026-05-08 13:42 UTC follow-up (aeef8b41) adds the cap-specific device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt assertion plus an exact-one cap-audit: cap=interrupt event=interrupt-waiter count. Per-row coverage: stale DMA handle (device-dma: stale dma handle proof, device-dma: live stale dma completion accounting proof); descriptor abuse (virtio-net: software descriptor generation model proof, virtio-net: invalid used descriptor id software-token proof, virtio-net: descriptor generation guard proof ok, virtio-net: invalid used descriptor id live software-token proof ok, plus device-dma: budget oom proof); revoke/reset race (device-manager: ownership proof, the seven device-manager: teardown trigger proof trigger=... variants plus the final aggregate, device-manager: dma completion handoff proof for S.11.2.8, device-manager: interrupt handoff proof for S.11.2.7, the device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio, device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt, device-manager: dmabuffer driver crash hook proof source=dmabuffer-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmabuffer, device-manager: dmapool driver crash hook proof source=dmapool-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmapool, device-manager: devicemmio reset disable hook proof source=devicemmio-reset-disable-hook ... trigger_path=trigger-reset-disable-for-devicemmio, device-manager: interrupt reset disable hook proof source=interrupt-reset-disable-hook ... trigger_path=trigger-reset-disable-for-interrupt, device-manager: dmabuffer reset disable hook proof source=dmabuffer-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmabuffer, device-manager: dmapool reset disable hook proof source=dmapool-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmapool, and device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt lines, all requiring first-trigger ok, stale rerun noop, cap validate_live=revoked, post-trigger release noop, and proof_result=ok with cap-specific cleanup/evidence labels); stale IRQ after reset (S.11.2.7 closure summary, see row above); stale DMA completion after reset (S.11.2.8 closure summary, see row above); exit-under-DMA (device-manager: teardown trigger proof trigger=process-exit owner=virtio-rng, the teardown-trigger aggregate triggers=cap-release,process-exit,driver-crash,reset-disable,interrupt-waiter,future-devicemmio,future-dmapool line, the four cap-release-hook proofs each containing process_exit_path=process-release-caps-for-exit, plus hardware-cap-release: ... reason=process-exit count assertions). A 2026-05-23 21:34 UTC follow-up adds the IOMMU production DMAPool hostile proof over the active mapped ledger, covering stale IOVA after revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion after reset, teardown-under-DMA ordering, cross-domain stale-handle attempts, and the fail-closed teardown branch proof; process-exit/exit-under-DMA remains the existing run-net bounce-buffer evidence. Production userspace DeviceMmio/Interrupt handles, broader non-proof device-manager page quiesce/scrub/release hooks outside the selected IOMMU smoke, hardware-backed provider-driver Interrupt wait/ack dispatch beyond the bounded route-dispatch waiter proof, and durable/signed production audit consumption beyond the first volatile HardwareAuditLog.snapshot cap remain open as separate follow-ups. |
For each row, the transition requires an owner, implementation notes, and a CI-backed
verification path. Until all rows pass, Phase 4.2 NIC/block drivers remain in-kernel for
functionality, and only kernel-mapped bounce-buffer mode is allowed for prototype DMA.
Hostile-Smoke Acceptance Matrix
These smokes are the acceptance requirements for the userspace driver
transition. The S.11.2.7, S.11.2.8, and S.11.2.9 rows are now backed by
current make run-net QEMU evidence enforced by tools/qemu-net-smoke.sh
(see the per-row “Closed” notes for closure timestamps and the proof-line
shapes). The other matrix rows remain acceptance requirements for future
implementation work; their proof lines are emitted by the kernel today
and asserted by the same harness, but the production userspace handles,
real device-manager page quiesce/scrub/release hooks, real userspace
Interrupt waiter objects, IOMMU domain programming, and durable/signed
production audit consumption beyond the volatile HardwareAuditLog.snapshot
cap that complete each row’s full closure remain open as separate
follow-ups.
| Hostile case | Required setup | Closed-case proof expectation |
|---|---|---|
| Stale DMA handle | Allocate a DMA buffer, revoke or free it, advance the slot or pool generation, then attempt descriptor submission or buffer reuse through the old handle. | The operation fails closed on generation mismatch; no descriptor is made visible to the device, no DMA byte or buffer hold is restored, and any reused slot remains owned only by the new generation. |
| Descriptor abuse | Submit chains with out-of-pool addresses, stale or freed buffer slots, arithmetic wrap, misalignment, overlong segments, excessive chain length, or ring-depth overflow. | Validation rejects the chain before any doorbell write; the ledger shows no leaked descriptor hold, no in-flight increment without an owning buffer, and no access outside the pool range. |
| Revoke/reset race | Race revoke, reset, or process teardown against a driver that is submitting descriptors or ringing the device doorbell. | Revocation first invalidates handles and MMIO write authority; later submissions fail closed, existing in-flight records are either completed under the old generation or reset/disabled before page reuse, and teardown cannot skip to DmaMappingsRemoved while the ledger has live submissions. |
| Stale IRQ after reset | Leave an interrupt pending or a waiter blocked, reset or reassign the device/source, then deliver or acknowledge using the old generation. | The old waiter cannot wake against the new owner, stale acknowledgements do not affect the reassigned source, and the source is masked, detached, or generation-invalidated before reassignment. Closed 2026-05-05 18:17 UTC: make run-net injects a real INT $vector through the IDT/handler/EOI path at three points across revoke, detach, and reset/reuse and records s11_2_7_real_irq_injected_across_reset=ok, s11_2_7_old_waiter_cannot_wake_new_owner=true, s11_2_7_stale_ack_blocked=true, plus matching real_irq_inject_after_revoke_result=masked, real_irq_inject_after_detach_result=unregistered, real_irq_inject_after_reset_reuse_result=masked on the kernel proof line. |
| Stale DMA completion after reset | Reset with outstanding descriptors, reuse or prepare to reuse pool slots, then inject or observe a completion from the old device generation. | The stale completion cannot publish a CQE to a new owner, cannot expose new-owner memory, cannot underflow accounting, and cannot make a freed buffer eligible for reuse unless reset/disable has proven old DMA stopped. Closed 2026-05-05 19:37 UTC: make run-net walks a fresh device-manager record on the virtio-net BDF through the Active>RevokingHandles>MmioRevoked>InterruptsDetached>QueuesQuiesced>Resetting>DmaMappingsRemoved>Dead revocation path, exercises a real virtio-net DMA page free + reallocate cycle at three boundaries (after revoke, after detach, after reset/reuse), and feeds a synthesized stale DeviceDmaAllocation (live phys, decremented generation) to the production device_dma::record_virtio_net_completion_for_allocation path. Each boundary records real_completion_inject_after_*_result=stale-dma-handle, _side_effect=side-effect-blocked, _queue_account_preserved=true, _live_page_preserved=true, _cq_publication_blocked=true, _new_owner_exposure_blocked=true, _freed_buffer_unchanged=true, and _generation_bumped=true, plus a closure summary s11_2_8_real_completion_injected_across_reset=ok, s11_2_8_old_completion_cannot_publish_to_new_owner=true, s11_2_8_freed_buffer_reuse_blocked=true, s11_2_8_accounting_underflow_blocked=true. |
| Exit-under-DMA | Terminate or crash a driver process while it holds DMA buffers, MMIO mappings, interrupt waiters, and in-flight descriptors. | Process exit enters the device-manager teardown path, invalidates all user-visible handles, revokes MMIO, detaches interrupts, quiesces or resets queues, scrubs DMA pages before release, and reports a terminal ledger with no live holds for the old owner generation. |
Security Verification Track S.11.2 Decision Record
Security Verification Track S.11.2 is backend-scoped. The current brokered-bounce userspace-provider path has enough reviewed evidence to close the retained DDF production-authority finding, but that closeout is not a general direct-DMA, hostile-hardware, or device-autonomous interrupt claim.
Current status: the brokered-bounce transition path is represented by done task
evidence for DMAPool, DeviceMmio, and Interrupt lifecycle ownership,
provider virtio-net/NVMe chains, and hardware-audit consumption of abort-held
DMA mappings. The broader S.11.2 matrix remains the canonical gate for future
direct-remapping/vIOMMU, trusted-sharing-group, hostile-hardware-isolation, or
provider-written-address work. This document fixes the production handle epoch
invariants, DMAPool ledger of record, and hostile-smoke acceptance criteria
used by the completed Device Driver Foundation documentation gate. The
current QEMU virtio-net path has a kernel-owned DMA pool ledger for page,
descriptor, MMIO mapping, and interrupt-hold accounting proof coverage plus
static IOMMU attachment-policy reporting for retained DMA-capable PCI functions
and the bounded teardown trigger contract proof, bounded kernel-owned
budget/OOM proof, manager-bound DMAPool budget-profile proof plus bounded
budget-policy tamper and accounting-over-budget fail-closed proofs,
bounded manager-owned DeviceMmio proof adapter bound to decoded PCI
memory-BAR metadata plus future cache/write-policy metadata, bounded zero-live
device-manager DMAPool record lifecycle proof, and imported live-accounting
block/defer proof plus zero-live teardown-evidence scratch proof, stale DMA
handle scratch proof, stale DMA completion scratch proof, paired scratch
CQ-publication/new-owner-exposure proof, live software descriptor-generation
guard proof, bounded invalid used-descriptor-id proof, and bounded stale IRQ
after-detach, counter-backed after-revoke, counter-backed route-registry
reset-reuse, and pending IRQ token checks described above. The
bounded pure capos-lib::device_authority
validator and host tests cover the documented identity, state,
side-effect-blocking, non-wrapping epoch cases, and every current operation
variant’s exact blocked side-effect label for stale owner/subrecord, freed,
revoked, and retired failures. The zero-live
device-manager DMAPool lifecycle proof now validates a proof-scoped
tampered budget-policy record through the manager policy helper and records
fail-closed, no fake allocation, no ledger mutation, no teardown advancement,
and side-effect blocking while preserving the positive
budget_policy_result=ok path. The positive zero-live and imported-live
budget-accounting labels now go through the manager-owned active-record helper,
and synthetic over-budget attached-accounting candidates fail closed with exact
reasons while preserving the active manager record and blocking allocation,
ledger, teardown, and side effects; an over-budget attach candidate fails
before pool generation allocation. It also records a bounded
manager-attached DMA buffer handle under the attached pool, validates active
SubmitDescriptor and manager-record CompleteDescriptor through the pure
DMA-buffer validator, and records stale-after-revoke, freed-buffer, and
reused-slot rejection with exact reasons and side-effect-blocked; it now
also blocks pool teardown as
dmapool-buffer-attached, rejects a stale same-slot proof-scoped FreeBuffer
as dmabuffer-stale-handle with stale-slot-generation and
side-effect-blocked, rejects wrong-owner-generation, wrong-pool, wrong-pool
generation, and wrong-buffer-slot FreeBuffer attempts with exact pure
validator reasons and side-effect-blocked, preserves that manager-owned
buffer record after each failed free, and clears the record only after a
proof-scoped active FreeBuffer validation, proof-page scrub/free, and
manager-owned buffer-record detach. The completion proof does not publish a CQ
entry, complete a real descriptor, grant userspace authority, or clean up or
reuse production userspace DMA pages. The live virtio-net queue-completion
path now gates completion accounting on the completed descriptor’s
DeviceDmaAllocation rather than the queue id alone: callers must validate
the used descriptor id, recover the matching DmaPage, and pass its physical
address, queue, label, and generation to the kernel-owned ledger before
in-flight accounting is decremented. The paired run-net proof records that a
stale generation for a live kernel-owned page fails as stale-dma-handle,
leaves queue accounting and the live page unchanged, and blocks CQ publication
plus new-owner exposure. This closes a live accounting prerequisite only; it
does not inject a real post-reset device completion or expose userspace DMA
authority. The live virtio-net used-ring path also carries bounded software
descriptor generations: submissions reject invalid or already-active descriptor
ids before accounting, completions must consume the active software token
exactly once, and the run-net proof records side-effect blocking for active
reuse, double completion, and an old software token after descriptor-id reuse.
That guard does not make a stale hardware used-ring id distinguishable after
deliberate id reuse because virtio used entries carry no device generation. The
same gate now also covers invalid used-descriptor ids without corrupting the
hardware ring: an out-of-range id fails as descriptor-id-out-of-range before
completion observation, completion accounting, used_seen_idx, CQ publication,
or new-owner exposure can change. This is still a software-token and
constructed-token prerequisite, not a real malformed-device or post-reset
completion injection. The
same zero-live proof now also constructs the result-only
DMAPool.info cap skeleton from the manager-issued DmaPoolHandle, validates
the active manager record before returning conservative status labels plus
numeric device/BDF/owner/pool identity fields, proves the serialized cap call
path decodes to those labels and identity fields with host physical exposure
off and direct DMA blocked, and proves the cap’s info path fails closed as
dmapool-stale-handle after revoke begins. It also exercises
DMAPool.allocateBuffer through call_with_table() on a real cap-table entry,
returns zero-indexed DMABuffer result caps for eight fixed manager-owned
bounce-buffer slots, validates those result caps’ DMABuffer.info, and proves
a ninth allocation fails through the manager-owned budget policy as
dmapool-budget-exceeded / over-buffer-budget before publishing another
result cap or corrupting live slot state; full-pool allocation also preserves
manager generation counters. Stale-after-revoke allocations still fail closed
without publishing another result cap. The same zero-live proof
constructs the result-only
DMABuffer.info cap skeleton from the manager-attached DmaBufferHandle,
validates the active manager-owned buffer record through the pure DMA-buffer
validator before returning conservative no-authority labels plus numeric
device/BDF, owner/pool/slot identity fields, proves the serialized cap call
path decodes to those labels and identity fields with host physical exposure
off and direct DMA blocked, and proves the cap’s info path fails closed as
dmabuffer-stale-handle after revoke begins; the same stale cap’s serialized
method-0 path fails as invoke-failed. The first DmaBufferCap release hook
now reuses the bounded FreeBuffer validation shape to clear only the
manager-attached proof_buffer record during cap-table removal, production
ring CAP_OP_RELEASE, and real Process::release_caps_for_exit() paths. It
proves stale same-slot release is side-effect-blocked, proves the parent
DMAPool remains attached after buffer release, proves the bounded manifest
grant can allocate the slot again after explicit freeBuffer with a fresh slot
generation, and still requires staged zero-live evidence before the parent pool
can detach. The selected provider-TX path now adds a bounded exception to the
default manager-accounting descriptor contract: queue 1 submits may publish
the selected eight-entry TX queue depth, descriptors 0..7, into the existing
kernel-owned virtio-net TX ring before the first completion, ring one selected
notify doorbell per accepted provider descriptor, and then complete each
descriptor only after DMABuffer.completeDescriptor observes the matching
used-ring entry for the stored software descriptor generation. Those handoffs
clear the matching manager in-flight records, record bounded provider CQ
completion and acknowledgement counts, and can deliver ordered bounded
completion events to live tx_interrupt.wait calls for the same selected
route. The selected provider-TX path also proves a teardown-only drain when one
descriptor has completed and seven provider-published descriptors remain
incomplete: direct DMABuffer.freeBuffer remains blocked while in flight,
release explicitly drains only the incomplete matching used-ring entries and
retires those allocation-backed TX DMA queue ledgers without
DMABuffer.completeDescriptor results, no provider CQ/IRQ event is published
for the quiesced descriptors, release retires seven delivered-but-unacked
completion events, and later slot reuse requires a fresh generation plus normal
completion. Wrong-queue, stale-buffer, stale-notify, inflight-publication,
wrong-descriptor, duplicate-completion, and stale-tx_interrupt issue paths
remain side-effect-blocked before their guarded effects. This does not grant
direct DMA, arbitrary doorbells, arbitrary CQ ownership outside the selected TX
route, full virtio-net ownership, production NIC/storage migration, IOMMU
programming, hardware IRQ ownership, hardware acknowledgement, or broad
interrupt ownership beyond the bounded selected TX MSI-X mask/unmask proof.
The bounded DeviceMmio proof also records the manager-attached policy
metadata listed above, fails closed on a tampered cache/write-policy record
before creating any mapping, and validates active hostile handle identities for
wrong owner generation, wrong mapping generation, wrong mapping id, wrong BAR,
and wrong BDF/device with exact pure-validator reasons while preserving the
attached record and blocking mapping/doorbell side effects. Its serialized cap
call path also decodes to the direct DeviceMmio.info no-authority labels plus
numeric device/BDF, owner, BAR, mapping id, and mapping generation identity
fields with host physical exposure off and direct MMIO blocked, and its stale
serialized method-0 path fails as invoke-failed. The DMAPool.info skeleton
has the same kernel-side serialized stale failure evidence. The
interrupt handoff proof now also constructs a result-only Interrupt.info
cap skeleton from the manager-issued device handle and attached route record,
records active info success, proves the serialized cap call path decodes to the
direct no-authority labels plus numeric device/BDF, owner, source, source
generation, and route generation identity fields, proves those source and
route generations are distinct in the bounded route record, and proves
stale-after-revoke info fails closed as
interrupt-stale-handle plus stale serialized method-0 failure as
invoke-failed before any acknowledgement, mask, unmask, blocking wait, or
delivery authority exists. The manifest-granted skeleton now also exposes an
admission-only Interrupt.wait method that returns the pending-token
validator’s fail-closed labels without waking a waiter or changing delivery
counts, and an admission-only Interrupt.acknowledge method that validates the
active route while blocking hardware acknowledgement and preserving delivery
counts. It also exposes route-state-control Interrupt.mask and
Interrupt.unmask methods that validate the active route before changing the
manager-attached dispatch slot between claimed-masked and
driver-unmasked, while preserving delivery counts. A bounded
Interrupt.wait call observed after unmask installs a fixed-table userspace
waiter object for the current manager-granted route; the existing
route-dispatch delivery counter can now complete that waiter as
interrupt-delivered / waiter-completed-irq with
real_interrupt_delivery=delivered and an advanced delivery count. The same
focused smoke then submits a second unmasked wait, observes it remains pending,
calls Interrupt.mask, and finishes that wait as
interrupt-waiter-cancelled / route-masked /
waiter-completed-no-irq with wake_blocked=false, preserved source/route
generations, and unchanged delivery counts. The selected provider TX
tx_interrupt cap can now observe the bounded used-ring completion event
described above and account the already observed selected TX dispatch token
paired with that delivered provider CQ event, but hardware MSI/MSI-X programming
beyond the selected vector-control proof, full hardware IRQ ownership, deferred
EOI, LAPIC/MSI-X acknowledgement, and broader production interrupt dispatch
remain blocked. Provider TX MSI-X mask/unmask is limited to the selected-route
vector-control proof described earlier. Provider RX MSI-X mask/unmask remains
bounded to the selected RX route as well; release while masked restores that
selected vector-control bit and route state before clearing the live issue gate.
RX unmask admits the route transition before exposing the MSI-X vector-control
bit, and the focused QEMU proof shows a failed route unmask leaves the selected
vector masked with the route ledger preserved. Cleanup failure still leaves the
issue uncleared so future RX cap issuance stays blocked on uncertain route
state. RX wait/ack is now bounded to one selected-route zero-CQ dispatch token;
RX descriptors and CQ ownership remain blocked.
This is manager-record skeleton/no-production-DMA, no-real-MMIO-mapping, and
bounded route-dispatch interrupt-waiter prerequisite evidence only. Production
DMAPool, DeviceMmio, and Interrupt capability handles, production
userspace DMAPool buffer handles, real DeviceMmio BAR mapping objects,
real cache attributes/write policy enforcement, production kernel device-path
wiring beyond the current proof adapters, real device-manager page
quiesce/scrub/release hooks and real page cleanup/reuse beyond the bounded
kernel-owned proof pages, production
handle-attached budget/OOM enforcement beyond the current manager-owned
DMAPool.allocateBuffer budget slice, IOMMU remapping domains, production
handle-attached host tests, QEMU stale-handle smokes, broader userspace
exposure, production NIC/storage migration, cloud readiness, and S.11.2
hostile smokes remain open.
Do not weaken the short-term virtio-net bounce-buffer path until DMAPool,
DeviceMmio, Interrupt, device-manager ownership transactions, lifecycle
teardown, accounting, and hostile smokes all exist.