# NVMe Userspace Provider: Conditional Model B Doorbell/Notify DMA Validator

## Operator Decision (2026-05-27)

The userspace NVMe-class storage provider
(`docs/proposals/cloud-driver-foundation-gap-analysis.md`, NVMe child chain)
selected **Model B: provider-writes-everything, kernel-validates-on-notify** for
the direct-remapping userspace-driver lane. This was intended to override the
kernel-mints-the-address model (Model A) that the gap analysis originally
recommended for the storage chain and that the landed virtio-net TX provider
uses.

The operator's stated reason: capOS wants the *genuine* userspace-driver model,
where the driver process — not the kernel — owns and writes the device-visible
addresses it programs into the controller. Model A keeps device-address minting
inside the kernel, which is safe but is not a real userspace driver: the
provider only *places* a value the kernel already chose. Model B makes the
provider a first-class driver and moves the kernel from address-author to
address-validator.

Correction recorded later on 2026-05-27: Model B cannot be used on the current
no-IOMMU `run-pci-nvme` or probed GCP bounce-buffer path without exporting host
physical addresses to userspace. It remains valid for a verified
direct-remapping/vIOMMU lane, or for a future synthetic device-address namespace
that the manager translates before hardware sees it. The GCP/no-IOMMU path must
use brokered bounce address publication instead.

This is a design-and-task slice only. The landed `nvme-doorbell-dma-validator`
mechanism remains the direct-remapping/synthetic-address validator component;
the no-IOMMU controller-enable work is re-planned as a brokered-bounce slice.

## Model A vs Model B

| Dimension | Brokered address publication (kernel/device-manager materializes) | Model B (provider-writes, kernel-validates) |
| --- | --- | --- |
| Who writes the device-visible address | Kernel or device manager writes queue-base/PRP/SGL values from live buffer authority. Provider submits typed requests or places opaque kernel-authored values only when that is safe. | Provider writes the device-visible address itself into ASQ/ACQ/SQ/CQ bases and PRP/SGL entries. |
| Kernel role | Author of every device address; trivially correct by construction; no scan needed. | Validator: on each doorbell/notify, scan the submitted descriptors/queue-base registers and reject any address outside the owner's granted DMA window. |
| New kernel component | None. | A ring/queue-scan **on-notify DMA validator** (this proposal). |
| Driver authenticity | Provider owns protocol choices but not raw device-address authorship. This is required when device-visible equals host physical. | Provider is a real driver that owns its addresses. |
| Where it applies | No-IOMMU brokered-bounce paths, including probed GCP shapes and the current no-IOMMU `run-pci-nvme` gate. | Verified direct-remapping/vIOMMU paths, or a future synthetic address namespace. |

The two models coexist. The existing virtio-net TX path keeps brokered/kernel-
authored device addresses. The NVMe validator is retained for lanes where
provider-written addresses are not host physical addresses. A `DeviceMmio`
doorbell claim must declare which model is active; no-IOMMU claims must not
accept provider-authored raw device addresses.

## What Model B Requires: the On-Notify DMA Validator

The validator is a kernel component invoked on the **doorbell/notify path** of
the NVMe provider's `DeviceMmio` selected-write claim. Before the doorbell write
reaches the device (i.e. before the controller can fetch the just-submitted
descriptors or act on a just-programmed queue base), the kernel scans the
device-visible addresses the provider wrote and **fails closed** if any address
is not inside that owner's granted DMA window(s).

### Scan targets (what the validator reads)

1. **Queue-base registers**, scanned when the doorbell/notify that arms a queue
   is rung (or on the controller-enable / `CC.EN` write that activates the admin
   queue): `ASQ`, `ACQ`, and the I/O `SQ`/`CQ` base addresses the provider
   programmed through its selected-write `DeviceMmio` claim.
2. **Submission-queue entries** newly made visible by an SQ tail doorbell: the
   PRP1/PRP2 entries (and, where used, the PRP list pages and SGL descriptors)
   of each NVMe command between the last validated tail and the new tail. The
   validator follows one level of PRP-list indirection; deeper SGL/PRP-chain
   shapes are out of scope for the bounded proof and are rejected, not silently
   accepted.

The validator scans **only on notify** — not on every provider memory write.
The provider may freely write into its own mapped DMA pages between doorbells;
nothing device-reachable happens until a doorbell rings, and that is the single
choke point the kernel guards. This bounds the validation cost to the
descriptors a single doorbell newly publishes (one queue entry for a depth-1
admin proof, a small bounded batch otherwise), not to the whole address space.

### Invariants (fail-closed on any violation)

- **Bounds.** Every scanned device-visible address, and the full extent of the
  region it names (queue size × entry size for a queue base; transfer length for
  a PRP/SGL data pointer), must lie wholly within a DMA window granted to the
  owning provider. An address at the window edge whose region runs past the
  window end fails closed. Unaligned queue-base or PRP addresses (NVMe requires
  page-aligned PRP1 for the first entry, dword-aligned queue bases) fail closed.
- **Owner-scoping.** The window set checked is exactly the set granted to the
  provider that owns the `DeviceMmio` doorbell claim being rung. An address that
  is valid for *another* owner's window is rejected for this owner: no aliasing
  into a different owner's DMA region, no host-physical address, no
  out-of-any-window address. The validator resolves "owner" from the doorbell
  claim's grant identity, not from the address value.
- **No host-physical / no out-of-window.** The provider-written value must be a
  domain-scoped IOVA or synthetic device address, never a host physical address.
  On the current no-IOMMU bounce path this invariant cannot be satisfied by
  provider-authored queue-base/PRP values, because device-visible equals host
  physical and userspace export is disabled.
- **Stale-completion / generation.** The validator binds its accept decision to
  the live grant generation of the owner's DMA window and doorbell claim. A
  doorbell rung after revoke/reset/regrant against a stale generation fails
  closed even if the byte value would have been in-window for the prior grant.
  Completions are accepted only against the issue/generation that was live at
  submission scan time, matching the existing stale-completion gate on the
  virtio-net path; a completion whose submission was never validated (or was
  validated under a now-retired generation) does not wake a waiter.
- **On-notify timing.** The scan completes and either accepts or rejects
  **before** the doorbell write is allowed to take effect on the device. A
  rejected scan does not write the doorbell, returns a fail-closed error to the
  provider's `DeviceMmio` write, and records the rejection; the device never
  sees the descriptor batch. There is no window in which the controller can
  fetch an unvalidated descriptor.
- **Quiesce/teardown.** On release/reset/driver-death, in-flight doorbell scans
  are quiesced, the owner's windows are removed from the validator's accepted
  set, backing pages are scrubbed before frame reuse, and any subsequently rung
  doorbell against the retired grant fails closed.

### Where it hooks

The validator hooks the NVMe provider's **selected-write `DeviceMmio` doorbell
claim** in the kernel capability layer — the same selected-write claim the
bring-up slice scopes to the NVMe enable/admin-queue-base/doorbell registers
(mirroring the virtio-net notify-write claim). Concretely:

- The doorbell/queue-base `DeviceMmio.write*` path
  (`kernel/src/cap/device_mmio.rs`) gains a pre-write validation step for the
  NVMe doorbell/queue-arm register subset.
- The scan reads the provider's mapped SQ pages and queue-base register shadow
  through the manager-owned DMA window records
  (`kernel/src/device_dma.rs`), checking containment against the owner's granted
  window descriptors. It does not gain a generic memory-read authority over the
  provider; it reads only the descriptor/queue-base bytes the doorbell newly
  publishes, via the manager's record of the owner's DMA pages.
- Generation/owner identity comes from the grant ledger
  (`kernel/src/device_dma.rs` / the `*_grant_source` records), not from
  provider-supplied metadata.

This is a kernel-side, capability-scoped, on-notify check — not a new ambient
syscall and not a per-write trap on all provider memory.

### Performance note

The validator runs **only on the notify/doorbell path**, not on the data path
and not on every provider write. Its cost is O(descriptors newly published by
this doorbell) — one entry for the depth-1 admin/IDENTIFY proof, a small bounded
batch for the I/O queue. Steady-state provider memory writes between doorbells
are uninstrumented. This keeps the genuine-driver model without a per-access
trap and without copying the data path through the kernel.

## No-IOMMU Correction And Brokered Bounce Path

On GCE shapes without a usable guest IOMMU, and on the current no-IOMMU
`make run-pci-nvme` gate, the labeled bounce-buffer backend does **not** provide
a provider-visible IOVA namespace. The device-visible value a real NVMe
controller consumes is the host physical or bus address of a manager-owned page.
Publishing that value to userspace would violate the reviewed
no-host-physical-exposure invariant.

Therefore the no-IOMMU storage path must be brokered:

- The provider receives buffer capabilities, queue ownership handles, and typed
  NVMe command intent, not raw queue-base or PRP addresses.
- The kernel or device manager allocates/pins the bounce pages and writes
  `AQA`/`ASQ`/`ACQ`, I/O queue-base, and PRP/SGL fields from the live ledger.
- The selected `DeviceMmio` claim gates `CC.EN`, queue-arm, and doorbell writes
  on the brokered ledger state, not on provider-supplied numeric addresses.
- Teardown still quiesces outstanding DMA, blocks stale completions, scrubs
  pages before reuse, and keeps `hostile_hardware_isolation=not-claimed`.

Model B can be reintroduced for NVMe when the proof gate is a verified
direct-remapping/vIOMMU shape where the provider-visible value is a
domain-scoped IOVA, or after capOS implements a synthetic address namespace that
is translated by trusted code before the controller observes it.

## Brokered Alternative For No-IOMMU

The brokered model is no longer a rejected storage alternative for no-IOMMU
targets. It is the required GCP/no-IOMMU design until a safe non-host-physical
device-address namespace exists. Its tradeoff is narrower driver authenticity:
userspace owns NVMe protocol state and command construction, but trusted kernel
or manager code remains the author of raw device addresses.

## Implementing Slices

- `nvme-doorbell-dma-validator` (**landed 2026-05-27 08:56 UTC**): the kernel on-notify DMA
  validator mechanism (`kernel/src/cap/nvme_doorbell_validator.rs`,
  `validate_doorbell_scan` / `completion_wakes_waiter`) and its invariants, proven
  by the bounded `cfg(qemu)` hostile-scan self-test
  (`prove_qemu_on_notify_scan_contract`) that `make run-pci-nvme` asserts:
  out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned,
  deeper-PRP-chain, and stale-generation all fail closed with no doorbell write
  and no waiter wake. Synthetic owner windows stand in for the live grant ledger;
  the live `DeviceMmio` doorbell-path wiring is the bring-up slice below. This is
  the kernel component Model B requires; the controller bring-up slice depends on
  it. Provenance map: [`docs/devices/nvme.md`](../devices/nvme.md).
- `nvme-no-iommu-brokered-controller-enable` (**landed 2026-05-27 21:38 UTC**,
  commit `11b86568`): no-IOMMU replacement for the blocked provider-written
  enable task; brokered admin queue-base materialization with no host-physical
  export.
- `nvme-userspace-bind-and-controller-bringup`: remains blocked unless
  re-scoped to an IOMMU/vIOMMU proof lane or replaced by the brokered no-IOMMU
  slice above.
- `nvme-admin-queue-identify` (**landed 2026-05-27 22:34 UTC**, commit
  `cede5257`) closes the no-IOMMU admin command.
- `nvme-admin-interrupt-delivery` (**landed 2026-05-27 23:07 UTC**, commit
  `18fd25c7`) closes the admin completion wake.
- `nvme-io-queue-and-read` is the ready brokered I/O/read continuation. It
  inherits the same split: provider-written PRPs require direct remapping or a
  synthetic namespace; no-IOMMU GCP planning requires brokered PRP
  materialization.

## Design Grounding

- `docs/proposals/cloud-driver-foundation-gap-analysis.md` (the foundation map
  and the original Model A recommendation this overrides for storage)
- `docs/dma-isolation-design.md` (Cloud DMA Backend; bounce-buffer fallback;
  IOVA/window discipline; teardown/scrub ordering)
- `docs/proposals/dma-assurance-model-proposal.md`
- `docs/tasks/done/2026-05-23/ddf-provider-virtio-net-driver-closeout.md`
  (the Model A virtio-net TX provider that this leaves unchanged)
- `kernel/src/cap/device_mmio.rs` (the selected-write claim the validator hooks),
  `kernel/src/device_dma.rs` (owner DMA window records / grant generation),
  `kernel/src/cap/{dma_pool,dma_buffer,interrupt}_grant_source.rs`,
  `kernel/src/pci.rs` (NVMe enumeration today)
