Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVMe Userspace Provider: Conditional Model B Doorbell/Notify DMA Validator

Operator Decision (2026-05-27)

The userspace NVMe-class storage provider (docs/proposals/cloud-driver-foundation-gap-analysis.md, NVMe child chain) selected Model B: provider-writes-everything, kernel-validates-on-notify for the direct-remapping userspace-driver lane. This was intended to override the kernel-mints-the-address model (Model A) that the gap analysis originally recommended for the storage chain and that the landed virtio-net TX provider uses.

The operator’s stated reason: capOS wants the genuine userspace-driver model, where the driver process — not the kernel — owns and writes the device-visible addresses it programs into the controller. Model A keeps device-address minting inside the kernel, which is safe but is not a real userspace driver: the provider only places a value the kernel already chose. Model B makes the provider a first-class driver and moves the kernel from address-author to address-validator.

Correction recorded later on 2026-05-27: Model B cannot be used on the current no-IOMMU run-pci-nvme or probed GCP bounce-buffer path without exporting host physical addresses to userspace. It remains valid for a verified direct-remapping/vIOMMU lane, or for a future synthetic device-address namespace that the manager translates before hardware sees it. The GCP/no-IOMMU path must use brokered bounce address publication instead.

This is a design-and-task slice only. The landed nvme-doorbell-dma-validator mechanism remains the direct-remapping/synthetic-address validator component; the no-IOMMU controller-enable work is re-planned as a brokered-bounce slice.

Model A vs Model B

DimensionBrokered address publication (kernel/device-manager materializes)Model B (provider-writes, kernel-validates)
Who writes the device-visible addressKernel or device manager writes queue-base/PRP/SGL values from live buffer authority. Provider submits typed requests or places opaque kernel-authored values only when that is safe.Provider writes the device-visible address itself into ASQ/ACQ/SQ/CQ bases and PRP/SGL entries.
Kernel roleAuthor of every device address; trivially correct by construction; no scan needed.Validator: on each doorbell/notify, scan the submitted descriptors/queue-base registers and reject any address outside the owner’s granted DMA window.
New kernel componentNone.A ring/queue-scan on-notify DMA validator (this proposal).
Driver authenticityProvider owns protocol choices but not raw device-address authorship. This is required when device-visible equals host physical.Provider is a real driver that owns its addresses.
Where it appliesNo-IOMMU brokered-bounce paths, including probed GCP shapes and the current no-IOMMU run-pci-nvme gate.Verified direct-remapping/vIOMMU paths, or a future synthetic address namespace.

The two models coexist. The existing virtio-net TX path keeps brokered/kernel- authored device addresses. The NVMe validator is retained for lanes where provider-written addresses are not host physical addresses. A DeviceMmio doorbell claim must declare which model is active; no-IOMMU claims must not accept provider-authored raw device addresses.

What Model B Requires: the On-Notify DMA Validator

The validator is a kernel component invoked on the doorbell/notify path of the NVMe provider’s DeviceMmio selected-write claim. Before the doorbell write reaches the device (i.e. before the controller can fetch the just-submitted descriptors or act on a just-programmed queue base), the kernel scans the device-visible addresses the provider wrote and fails closed if any address is not inside that owner’s granted DMA window(s).

Scan targets (what the validator reads)

  1. Queue-base registers, scanned when the doorbell/notify that arms a queue is rung (or on the controller-enable / CC.EN write that activates the admin queue): ASQ, ACQ, and the I/O SQ/CQ base addresses the provider programmed through its selected-write DeviceMmio claim.
  2. Submission-queue entries newly made visible by an SQ tail doorbell: the PRP1/PRP2 entries (and, where used, the PRP list pages and SGL descriptors) of each NVMe command between the last validated tail and the new tail. The validator follows one level of PRP-list indirection; deeper SGL/PRP-chain shapes are out of scope for the bounded proof and are rejected, not silently accepted.

The validator scans only on notify — not on every provider memory write. The provider may freely write into its own mapped DMA pages between doorbells; nothing device-reachable happens until a doorbell rings, and that is the single choke point the kernel guards. This bounds the validation cost to the descriptors a single doorbell newly publishes (one queue entry for a depth-1 admin proof, a small bounded batch otherwise), not to the whole address space.

Invariants (fail-closed on any violation)

  • Bounds. Every scanned device-visible address, and the full extent of the region it names (queue size × entry size for a queue base; transfer length for a PRP/SGL data pointer), must lie wholly within a DMA window granted to the owning provider. An address at the window edge whose region runs past the window end fails closed. Unaligned queue-base or PRP addresses (NVMe requires page-aligned PRP1 for the first entry, dword-aligned queue bases) fail closed.
  • Owner-scoping. The window set checked is exactly the set granted to the provider that owns the DeviceMmio doorbell claim being rung. An address that is valid for another owner’s window is rejected for this owner: no aliasing into a different owner’s DMA region, no host-physical address, no out-of-any-window address. The validator resolves “owner” from the doorbell claim’s grant identity, not from the address value.
  • No host-physical / no out-of-window. The provider-written value must be a domain-scoped IOVA or synthetic device address, never a host physical address. On the current no-IOMMU bounce path this invariant cannot be satisfied by provider-authored queue-base/PRP values, because device-visible equals host physical and userspace export is disabled.
  • Stale-completion / generation. The validator binds its accept decision to the live grant generation of the owner’s DMA window and doorbell claim. A doorbell rung after revoke/reset/regrant against a stale generation fails closed even if the byte value would have been in-window for the prior grant. Completions are accepted only against the issue/generation that was live at submission scan time, matching the existing stale-completion gate on the virtio-net path; a completion whose submission was never validated (or was validated under a now-retired generation) does not wake a waiter.
  • On-notify timing. The scan completes and either accepts or rejects before the doorbell write is allowed to take effect on the device. A rejected scan does not write the doorbell, returns a fail-closed error to the provider’s DeviceMmio write, and records the rejection; the device never sees the descriptor batch. There is no window in which the controller can fetch an unvalidated descriptor.
  • Quiesce/teardown. On release/reset/driver-death, in-flight doorbell scans are quiesced, the owner’s windows are removed from the validator’s accepted set, backing pages are scrubbed before frame reuse, and any subsequently rung doorbell against the retired grant fails closed.

Where it hooks

The validator hooks the NVMe provider’s selected-write DeviceMmio doorbell claim in the kernel capability layer — the same selected-write claim the bring-up slice scopes to the NVMe enable/admin-queue-base/doorbell registers (mirroring the virtio-net notify-write claim). Concretely:

  • The doorbell/queue-base DeviceMmio.write* path (kernel/src/cap/device_mmio.rs) gains a pre-write validation step for the NVMe doorbell/queue-arm register subset.
  • The scan reads the provider’s mapped SQ pages and queue-base register shadow through the manager-owned DMA window records (kernel/src/device_dma.rs), checking containment against the owner’s granted window descriptors. It does not gain a generic memory-read authority over the provider; it reads only the descriptor/queue-base bytes the doorbell newly publishes, via the manager’s record of the owner’s DMA pages.
  • Generation/owner identity comes from the grant ledger (kernel/src/device_dma.rs / the *_grant_source records), not from provider-supplied metadata.

This is a kernel-side, capability-scoped, on-notify check — not a new ambient syscall and not a per-write trap on all provider memory.

Performance note

The validator runs only on the notify/doorbell path, not on the data path and not on every provider write. Its cost is O(descriptors newly published by this doorbell) — one entry for the depth-1 admin/IDENTIFY proof, a small bounded batch for the I/O queue. Steady-state provider memory writes between doorbells are uninstrumented. This keeps the genuine-driver model without a per-access trap and without copying the data path through the kernel.

No-IOMMU Correction And Brokered Bounce Path

On GCE shapes without a usable guest IOMMU, and on the current no-IOMMU make run-pci-nvme gate, the labeled bounce-buffer backend does not provide a provider-visible IOVA namespace. The device-visible value a real NVMe controller consumes is the host physical or bus address of a manager-owned page. Publishing that value to userspace would violate the reviewed no-host-physical-exposure invariant.

Therefore the no-IOMMU storage path must be brokered:

  • The provider receives buffer capabilities, queue ownership handles, and typed NVMe command intent, not raw queue-base or PRP addresses.
  • The kernel or device manager allocates/pins the bounce pages and writes AQA/ASQ/ACQ, I/O queue-base, and PRP/SGL fields from the live ledger.
  • The selected DeviceMmio claim gates CC.EN, queue-arm, and doorbell writes on the brokered ledger state, not on provider-supplied numeric addresses.
  • Teardown still quiesces outstanding DMA, blocks stale completions, scrubs pages before reuse, and keeps hostile_hardware_isolation=not-claimed.

Model B can be reintroduced for NVMe when the proof gate is a verified direct-remapping/vIOMMU shape where the provider-visible value is a domain-scoped IOVA, or after capOS implements a synthetic address namespace that is translated by trusted code before the controller observes it.

Brokered Alternative For No-IOMMU

The brokered model is no longer a rejected storage alternative for no-IOMMU targets. It is the required GCP/no-IOMMU design until a safe non-host-physical device-address namespace exists. Its tradeoff is narrower driver authenticity: userspace owns NVMe protocol state and command construction, but trusted kernel or manager code remains the author of raw device addresses.

Implementing Slices

  • nvme-doorbell-dma-validator (landed 2026-05-27 08:56 UTC): the kernel on-notify DMA validator mechanism (kernel/src/cap/nvme_doorbell_validator.rs, validate_doorbell_scan / completion_wakes_waiter) and its invariants, proven by the bounded cfg(qemu) hostile-scan self-test (prove_qemu_on_notify_scan_contract) that make run-pci-nvme asserts: out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned, deeper-PRP-chain, and stale-generation all fail closed with no doorbell write and no waiter wake. Synthetic owner windows stand in for the live grant ledger; the live DeviceMmio doorbell-path wiring is the bring-up slice below. This is the kernel component Model B requires; the controller bring-up slice depends on it. Provenance map: NVMe.
  • nvme-no-iommu-brokered-controller-enable (landed 2026-05-27 21:38 UTC, commit 11b86568): no-IOMMU replacement for the blocked provider-written enable task; brokered admin queue-base materialization with no host-physical export.
  • nvme-userspace-bind-and-controller-bringup: remains blocked unless re-scoped to an IOMMU/vIOMMU proof lane or replaced by the brokered no-IOMMU slice above.
  • nvme-admin-queue-identify (landed 2026-05-27 22:34 UTC, commit cede5257) closes the no-IOMMU admin command.
  • nvme-admin-interrupt-delivery (landed 2026-05-27 23:07 UTC, commit 18fd25c7) closes the admin completion wake.
  • nvme-io-queue-and-read is the ready brokered I/O/read continuation. It inherits the same split: provider-written PRPs require direct remapping or a synthetic namespace; no-IOMMU GCP planning requires brokered PRP materialization.

Design Grounding

  • docs/proposals/cloud-driver-foundation-gap-analysis.md (the foundation map and the original Model A recommendation this overrides for storage)
  • docs/dma-isolation-design.md (Cloud DMA Backend; bounce-buffer fallback; IOVA/window discipline; teardown/scrub ordering)
  • docs/proposals/dma-assurance-model-proposal.md
  • docs/tasks/done/2026-05-23/ddf-provider-virtio-net-driver-closeout.md (the Model A virtio-net TX provider that this leaves unchanged)
  • kernel/src/cap/device_mmio.rs (the selected-write claim the validator hooks), kernel/src/device_dma.rs (owner DMA window records / grant generation), kernel/src/cap/{dma_pool,dma_buffer,interrupt}_grant_source.rs, kernel/src/pci.rs (NVMe enumeration today)