# DMA User-Space Driver Isolation

This note records the DMA-addressing and isolation consequences capOS must use
when planning user-space storage and NIC drivers. It is intentionally about
authority boundaries, not about a particular NVMe or virtio implementation.

## Address Spaces And Trust Boundaries

A DMA-capable device does not use a process virtual address. It consumes a
device-visible address carried in descriptors, queue-base registers, PRP/SGL
entries, or an equivalent protocol field.

On a bare host with an IOMMU:

```text
user VA --CPU MMU--> host physical address
device IOVA --IOMMU--> host physical address
```

On a guest VM:

```text
guest user VA --guest MMU--> guest physical address --EPT/NPT--> host physical address
```

With a virtual or assigned IOMMU, a guest can additionally reason about:

```text
guest device IOVA --vIOMMU or paravirt grant layer--> guest physical address
```

The host still owns the real host IOMMU or equivalent hypervisor translation.
A guest-programmable vIOMMU is useful because it gives the guest kernel a
guest-internal DMA authority boundary; it is not direct control of the host
IOMMU.

## Host User-Space Driver Pattern

A safe host user-space driver resembles the VFIO/IOMMUFD split:

- The kernel owns PCI discovery, BAR assignment, PCI configuration mediation,
  IOMMU domain creation, DMA map/unmap, page pinning, interrupt or MSI-X
  routing, reset, hotplug, and revocation.
- The user-space driver owns protocol logic: queue formats, descriptor
  contents, device-specific register sequencing, doorbells, polling, completion
  handling, and command construction.
- The driver may receive a domain-scoped IOVA for a live buffer only when the
  kernel has installed and can revoke the IOMMU mapping for that device.
- The driver must not receive unrestricted host physical addresses.

UIO-style "map a BAR and deliver interrupts" is not a complete security model
for a DMA-capable PCI device. If a user-space process can program a DMA engine
through MMIO, then DMA isolation requires either an IOMMU domain or a stricter
broker that prevents raw device-address publication.

## Guest Microkernel Pattern

Host isolation and guest isolation are different claims.

For an assigned PCI device or SR-IOV VF without a guest-visible IOMMU, the host
can still protect itself by mapping the device only to the VM's memory. That
does not protect the guest kernel from an untrusted guest user-space driver:
from the guest's perspective the device can still DMA to arbitrary guest
physical pages.

Virtual devices have the same guest-internal issue in a different form. If an
untrusted driver can put arbitrary guest physical addresses into virtqueue
descriptors, the host backend can write into guest kernel memory while still
staying inside the VM boundary. The host remains protected; the guest kernel is
not.

A guest microkernel that wants untrusted user-space drivers therefore needs one
of these guest-visible authorization layers:

- a vIOMMU or virtio-iommu path where the guest kernel controls guest IOVA to
  guest physical mappings;
- a paravirtual grant-table model where descriptors carry grant identifiers
  instead of raw guest physical addresses;
- a trusted mediation service that owns descriptor/device-address fields and
  lets the untrusted driver submit only typed commands, buffer capabilities, or
  opaque handles.

The invariant is:

```text
Never let an untrusted guest driver provide a raw guest physical address to a
device or backend unless a guest-visible DMA authorization layer validates it.
```

## BAR, MSI-X, And DMA Are Separate Authority Surfaces

BAR/MMIO controls CPU-to-device register access. DMA controls
device-to-memory access. MSI/MSI-X controls device-to-interrupt-controller
messages. A safe user-space driver interface needs all three mediated.

- Mapping a BAR is not enough; a BAR write can enable bus mastering or ring a
  doorbell that makes descriptors visible to the device.
- MSI-X tables often live inside a BAR. A driver must not get arbitrary write
  access to MSI-X message address/data entries unless the kernel or hypervisor
  can mediate interrupt remapping.
- IOMMU memory remapping does not by itself protect BAR register semantics or
  interrupt routing.

For capOS, `DeviceMmio`, `DMAPool`/`DMABuffer`, and `Interrupt` must remain
separate capabilities with a single device-manager ledger tying them to the
same owner generation and teardown state.

## No-IOMMU Bounce-Buffer Consequences

On a shape without guest-programmable remapping, a real PCI device's
device-visible address is the host physical or bus address the controller uses
for DMA. A bounce buffer can keep the *data* path manager-owned, but it does
not magically create an untrusted-driver-safe IOVA namespace.

The no-IOMMU fallback can preserve no-host-physical-exposure only if userspace
does not author raw device-address fields. The kernel or a trusted device
manager must instead:

- allocate and pin the device-visible bounce pages;
- program queue-base registers and PRP/SGL or virtqueue address fields, or
  translate typed driver requests into those fields;
- copy between device-visible bounce pages and non-device memory when the
  selected backend requires it;
- quiesce outstanding DMA before revoke or page reuse;
- scrub bounce pages before reuse;
- keep `hostile_hardware_isolation=not-claimed`.

The costs are direct: extra copies, higher latency, CPU/cache pressure, bounded
pool exhaustion risk, more teardown bookkeeping, and no hostile-hardware memory
isolation claim. These costs are the price of not exposing host physical
addresses when no guest-programmable remapping exists.

## GCP And QEMU Implications

The GCE probes in
[`cloud-dma-provider-evidence.md`](cloud-dma-provider-evidence.md) show no
guest-programmable IOMMU on the sampled GCP shapes: no usable DMAR/IVRS/IORT
tables or IOMMU groups, and SWIOTLB software bounce buffering in the Linux
guest. Host-side or provider-side isolation may still exist, but capOS cannot
program or validate it from inside the guest.

The practical split is:

- QEMU `run-iommu-remapping` remains the right local proof lane for
  direct-remapping behavior: domain-scoped IOVA export, per-device domains,
  invalidation, faults, and stale-DMA behavior.
- GCP storage and NIC driver planning must treat the probed shapes as
  no-IOMMU/bounce-buffer targets until a future runtime probe observes a
  guest-programmable remapping unit.
- A design that requires the provider to write device-visible queue-base or
  PRP/SGL addresses is valid only on a verified direct-remapping/vIOMMU path, or
  after capOS implements a separate synthetic address namespace that the kernel
  translates before hardware sees it.
- On the current GCP/no-IOMMU path, the recommended storage design is
  brokered: userspace owns protocol decisions and buffer capabilities, while
  the kernel or device manager materializes the device-visible DMA addresses.

## Recommended capOS Backend Modes

Use three explicit modes in planning and task acceptance:

| Mode | When it applies | User-space device-address exposure |
| --- | --- | --- |
| `direct-remapping` | capOS discovers, programs, and validates a guest-visible IOMMU/vIOMMU domain. | Domain-scoped IOVA only, labeled as meaningless outside that domain. |
| `brokered-bounce` | No usable guest IOMMU, but a manager-owned bounce path can safely support the device. | None: provider passes buffer caps, grant IDs, or typed commands; kernel writes device-visible addresses. |
| `unsupported` | Observations are contradictory, unsafe, or no safe brokered path exists. | None: device stays unbound or disabled. |

For GCP today, `brokered-bounce` is the only credible storage/NIC driver target
on the probed shapes. `direct-remapping` remains a QEMU proof lane and a future
cloud/hardware lane only after runtime evidence shows guest-programmable
remapping.