# NVMe (NVM Express controller)

This is a provenance map for the NVMe controller wire subset the kernel
**Model B on-notify DMA validator** scans on the doorbell/queue-arm path. It
cites the spec basis, summarizes only the register and descriptor fields the
validator actually reads, and points into the implementation by symbol name. It
is not a re-spec.

**Maturity caveat.** This page documents the *DMA validator mechanism*, a
*brokered no-IOMMU bring-up* through one bounded I/O read on the local QEMU
`make run-pci-nvme` gate, and one bounded live-GCE Persistent Disk proof for the
production `provider-nvme-io-read` path. It is still **not** a general
production NVMe driver, not broad GCP/AWS/Azure storage readiness, and not a
provider-visible address or direct-DMA claim. It also records the 2026-05-27
correction: on the current no-IOMMU gate, provider-written queue-base or PRP
addresses would be host physical addresses, so the live no-IOMMU path must be
brokered by the kernel/device manager unless a verified IOMMU/vIOMMU or
synthetic address namespace is added. The capabilities implemented against
`make run-pci-nvme` and the later production cloudboot gates:

- `nvme-doorbell-dma-validator` is the kernel on-notify scan
  (`kernel/src/cap/nvme_doorbell_validator.rs`); it proves its invariants with a
  `cfg(qemu)` self-test (`prove_qemu_on_notify_scan_contract`) using synthetic
  owner windows in place of a live grant ledger.
- `nvme-bind-claimed-mmio-read` adds the read-only userspace bind (§4): the
  kernel claims the enumerated controller, preseeds its BAR0
  controller-register page, and stages the `DMAPool`/`DeviceMmio`/`Interrupt`
  bootstrap grant sources against it, and the userspace `nvme-bringup-smoke`
  provider reads `CAP`/`VS`/`CC`/`CSTS` through the brokered claim, proving the
  claim reaches a coherent NVMe BAR0 (live `CAP`, valid `VS` version). The
  controller is **firmware-initialized** under SeaBIOS NVMe boot-probe
  (`CC.EN=1`, `CSTS.RDY=1`), so the provider reports the observed enable/ready
  state rather than asserting reset.
- `nvme-controller-reset-selected-write` adds the userspace controller reset
  (§5): the `DeviceMmio` grant now carries a reset-only NVMe controller-register
  selected-write claim scoped to `CC`, and the provider drives the
  firmware-enabled controller to a known reset state (`CC.EN=0` → `CSTS.RDY=0`).
  This is the first genuine userspace NVMe controller-register *write*.
- `nvme-no-iommu-brokered-controller-enable` adds the brokered no-IOMMU enable
  (§6): the kernel authors `AQA`/`ASQ`/`ACQ` from the live DMA ledger and
  performs the `CC.EN` write, reaching `CSTS.RDY=1` without exposing a
  queue-base address.
- `nvme-admin-queue-identify` + `nvme-admin-interrupt-delivery` add the brokered
  admin SQ/CQ doorbell and one interrupt-driven `IDENTIFY` (§7).
- `nvme-io-queue-and-read` adds the brokered I/O queue pair and one bounded
  `READ` (§8) -- the last piece of the userspace storage-provider
  foundation.
- `cloud-prod-nvme-userspace-provider-readonly-bind-local-proof` ports the
  same-BDF read-only bind shape onto the **non-`qemu` cloudboot kernel** under
  the `cloud_nvme_readonly_bind_proof` Cargo feature. The feature constrains
  the three production grant sources (`devicemmio_grant_source_prod`,
  `dmapool_grant_source_prod`, `interrupt_grant_source_prod`) to the NVMe
  function (class `0x01` subclass `0x08`); the userspace
  `cloud-nvme-readonly-bind-smoke` provider receives a same-BDF
  `DeviceMmio`/`DMAPool`/`Interrupt` bundle, reads `CAP_LO`/`CAP_HI`/`VS`/`CC`/
  `CSTS` via brokered `DeviceMmio.read32`, releases the three caps, and
  asserts stale-handle rejection on each. Proof:
  `make run-cloud-provider-nvme-readonly-bind`. No `CC.EN` write, no admin
  or I/O queue, no `IDENTIFY`, no `Interrupt.wait`, no DMA, no live cloud.
- `cloud-prod-nvme-controller-reset-selected-write-local-proof` layers the
  reset-only `CC` selected-write authority on the same-BDF bundle under the
  `cloud_nvme_controller_reset_proof` Cargo feature (which implies
  `cloud_nvme_readonly_bind_proof`, so the picker constraints are
  inherited). The kernel admits exactly one brokered `DeviceMmio.write32`
  shape through `kernel::device_manager::stub::write_devicemmio_u32`: a
  `CC` write (offset `0x14`) whose `CC.EN` (bit 0) is cleared. A `CC`
  write that sets `CC.EN` fails closed with
  `devicemmio-nvme-cc-enable-deferred`, a write to any non-`CC` offset
  fails closed with `devicemmio-write32-register-unclaimed`, and an
  out-of-range or unaligned offset fails closed at the range validator,
  all before any volatile MMIO write touches the BAR. The userspace
  `cloud-nvme-controller-reset-smoke` provider receives the same-BDF
  bundle, reads `CAP_LO`/`CAP_HI`/`VS`/`CC`/`CSTS`, exercises the two
  fail-closed write probes (`CC.EN=1`, non-`CC` offset `0x18`), performs
  the admitted `CC.EN=0` reset write, polls `CSTS` until `CSTS.RDY=0`,
  re-reads `CC` to assert `CC.EN=0`, releases the three caps, and
  confirms stale-handle rejection. Proof:
  `make run-cloud-provider-nvme-controller-reset`. No `CC.EN=1` write, no
  admin or I/O queue, no `IDENTIFY`, no `Interrupt.wait`, no DMA, no live
  cloud.
- `cloud-prod-nvme-admin-queue-materialization-local-proof` materializes the
  admin SQ and admin CQ backing buffers on the same-BDF bundle under the
  `cloud_nvme_admin_queue_materialization_proof` Cargo feature (which
  implies `cloud_nvme_controller_reset_proof`, which implies
  `cloud_nvme_readonly_bind_proof`, so the picker constraints and the
  reset-only `CC.EN=0` claim are inherited). No new kernel admission
  surface is added: the production `device_manager::stub` already supports
  manager-owned bounce-buffer allocation through
  `stage_bounce_buffer_dmapool_record` +
  `issue_manager_attached_dmabuffer_handle_with_request` (a fresh
  zeroed-on-alloc kernel frame per buffer),
  scrub-before-frame-free on `detach_dmabuffer_record_for_cap_release`, and
  stale-handle rejection on the parked-slot ledger. The userspace
  `cloud-nvme-admin-queue-materialization-smoke` provider receives the
  same-BDF bundle, sequentially materializes the admin SQ backing buffer
  and the admin CQ backing buffer through the brokered
  `DMAPool.allocateBuffer` + `DMABuffer.{info,map,unmap,freeBuffer}` path
  (asserting `userspace_dma_buffer=manager-issued-bounce-buffer`,
  `iova_export=disabled-future-only`,
  `host_physical_user_visible=false`, and `device_iova=0` on each), writes
  and reads back a deterministic 256-byte template through the userspace
  VMA, asserts the freshly-allocated admin CQ frame reads back as zero
  before the write (scrub-before-reuse, paired with the admin SQ's
  scrub-before-frame-free), confirms post-free `DMABuffer.map` fail-closed
  on the stale handle, emits one
  `cloudboot-evidence: provider-nvme-admin-queue-materialization <token>`
  marker recording both manager-owned pool/buffer slot/generation
  identities and the discipline labels, releases the three bundle caps,
  and confirms stale-handle rejection on each. Proof:
  `make run-cloud-provider-nvme-admin-queue-materialization`. No NVMe
  controller register WRITE on this path (the kernel still admits the
  reset-only `CC.EN=0` claim from the controller-reset sibling, but this
  smoke never calls `DeviceMmio.write32`), no `AQA`/`ASQ`/`ACQ`
  publication, no `CC.EN=1`, no I/O queue allocation, no `IDENTIFY`, no
  PRP/SGL publication, no doorbell write, no `Interrupt.wait`/
  `Interrupt.acknowledge`, no host-physical or IOVA export, no live
  cloud.
- `cloud-prod-nvme-brokered-controller-enable-local-proof` enables the
  controller through manager-authored `AQA`/`ASQ`/`ACQ` plus a
  provider-supplied `CC.EN=1` write under the
  `cloud_nvme_controller_enable_proof` Cargo feature (which implies the
  three earlier features). The production `device_manager::stub` parked-pool
  slot holds two simultaneously-live bounce-buffer
  `DMABuffer`s (`PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 2`,
  `PARKED_DMABUFFER_SLOTS = [1, 2]`) so the admin SQ and admin CQ can stay
  parked together; the bounce-buffer grant proof and the virtio-net
  live-publish proof keep their existing single-buffer behavior (slot 0).
  The provider-supplied `CC.EN=1` write of this path is **superseded by
  `cloud-prod-nvme-controller-enable-manager-op-remediation`** below, which
  makes raw `DeviceMmio.write32(CC, value with CC.EN=1)` fail closed before
  any MMIO side effect and exposes controller enable only through the
  no-parameter `DeviceMmio.brokeredNvmeControllerEnable` verb (schema `@6`).
  The parked-pool slot capacity, the `[1, 2]` slot ids, the `AQA` depth
  policy, and the four MMIO writes the manager authors all carry over
  unchanged.
- `cloud-prod-nvme-controller-enable-manager-op-remediation` corrects the
  brokered enable contract. Raw `DeviceMmio.write32(CC, value with CC.EN=1)`
  now fails closed with
  `authority_result=devicemmio-nvme-cc-enable-raw-blocked /
  authority_reason=cc-enable-requires-broker-nvme-controller-enable-op`
  before any volatile MMIO side effect. Controller enable is reachable only
  through the new no-parameter `DeviceMmio.brokeredNvmeControllerEnable`
  verb (schema `@6`), which carries no `offset`, `value`, queue address,
  queue id, PRP/SGL, or provider-selected controller-bit parameter. The
  verb routes to the renamed manager-authored
  `nvme_brokered_controller_enable_op_for_cap` in
  `kernel/src/device_manager/stub.rs`. The manager: (1) validates the
  cap's BAR matches the parked region and covers the
  `CC`/`AQA`/`ASQ`/`ACQ` register span; (2) resolves the two parked admin
  queue DMABuffers (slot order: SQ then CQ) and requires both to be live,
  unmapped, and frame-aligned; (3) selects every controller bit internally
  -- `CC.EN | IOSQES=6 | IOCQES=4` (NVMe Base Spec §3.1.5); (4) authors
  `AQA = ((depth-1)<<16) | (depth-1)` with depth `8`, `ASQ` low/high from
  the admin SQ buffer's phys address, `ACQ` low/high from the admin CQ
  buffer's phys address through the boot-preseeded BAR0 kernel mapping;
  and (5) performs the manager-selected `CC.EN=1` write. The provider
  supplies no parameters and never observes a host-physical /
  device-visible queue-base address. The cap dispatch admission carries
  `authority_result=ok`, `register_write=performed`,
  `side_effect=mmio-write-performed`, `cc_en_write_performed=true`,
  `aqa_authored=true`, `asq_authored=true`, `acq_authored=true`, and
  `queue_base_source=manager-ledger`. The kernel diagnostic line is now
  `nvme: brokered-enable owner=cloud-nvme model=cloud-bounce validator=none
  trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 aqa=0x00070007
  cc=0x... asq_authored=true acq_authored=true cc_en_write=performed
  cc_bits_selected_by=manager queue_base_source=manager-ledger
  host_physical_user_visible=false proof_result=ok`; `trigger=manager-op`
  proves the admission entered through `@6`, not through a raw `CC`
  `write32`. The `cloud-nvme-controller-enable-smoke` provider proves both
  the new fail-closed raw `CC.EN=1` probe and the manager-op enable, and
  the headline `cloudboot-evidence: provider-nvme-controller-enable
  <token>` marker pins `brokered_enable_trigger=manager-op` and
  `cc_raw_enable_write=refused`. Proof:
  `make run-cloud-provider-nvme-controller-enable`. No `IDENTIFY`, admin
  or I/O queue command, PRP/SGL publication, doorbell write,
  `Interrupt.wait`/`Interrupt.acknowledge`, host-physical or IOVA export,
  or live cloud is claimed.
- `cloud-prod-nvme-admin-identify-manager-op-local-proof` extends the
  corrected controller-enable surface with one explicit manager-owned
  admin-command operation: `DeviceMmio.brokeredNvmeAdminIdentify` (schema
  `@7`). The verb carries no parameters; the cap holder may not supply
  queue addresses, opcode, command id, NSID, PRP/SGL entries,
  data-buffer address, doorbell offset, or doorbell value. The production
  `device_manager::stub` parked-pool slot capacity was extended from two
  to three simultaneously-live bounce-buffer DMABuffers
  (`PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 3`,
  `PARKED_DMABUFFER_SLOTS = [1, 2, 3]`) so the admin SQ (slot 1), admin
  CQ (slot 2), and IDENTIFY data page (slot 3) can stay parked together;
  the controller-enable sibling, the bounce-buffer grant proof, and the
  virtio-net live-publish proof keep their existing single- or
  dual-buffer behavior unchanged. The production grant source's
  kernel-mapped BAR window was correspondingly widened from one to two
  pages (`MAPPED_WINDOW_BYTES = 0x2000` under
  `cloud_nvme_admin_identify_proof`) so the admin SQ tail (`0x1000`) and
  admin CQ head (`0x1004`) doorbells fall inside the boot-preseeded
  mapping the manager already uses for `CC`/`AQA`/`ASQ`/`ACQ` -- raw
  `write32` to either doorbell offset still fails closed at the
  device-manager boundary as `devicemmio-write32-register-unclaimed`
  (the offset is outside the reset-only `CC` selected-write claim), and
  the brokered admin IDENTIFY verb is the only path that may ring them.
  The handler `nvme_brokered_admin_identify_op_for_cap` in
  `kernel/src/device_manager/stub.rs`: (1) validates the cap's BAR
  matches the parked region and covers both doorbell offsets and the
  `CSTS` register; (2) resolves the three parked admin DMABuffers and
  requires all three to be live, unmapped, and frame-aligned; (3)
  re-reads `CSTS` through the boot-preseeded BAR mapping and refuses if
  `CSTS.RDY=0`; (4) authors the full submission queue entry at admin SQ
  index 0 through the HHDM kernel mapping of the SQ page -- opcode
  `IDENTIFY` (`0x06`, NVMe Base Spec §5.17), command id 1, NSID 0, MPTR
  0, PRP1 = data-page physical address (sourced from the manager's
  parked-pool ledger), PRP2 0, CDW10 CNS `0x01` (Controller); (5) issues
  a SeqCst fence and rings the admin SQ tail doorbell at BAR0 offset
  `0x1000`; (6) polls the admin CQ entry at index 0 through the HHDM
  kernel mapping of the CQ page for the phase-bit flip (NVMe Base Spec
  §4.6 CQE DW3 bit 16); (7) inspects the CQE status field (bits 30:17 of
  DW3) and command-id echo, refusing on either mismatch; (8) parses
  IDENTIFY Controller VID (offset 0, 2 bytes) and SSVID (offset 2, 2
  bytes) through the HHDM kernel mapping of the data page; (9) advances
  the admin CQ head doorbell at BAR0 offset `0x1004`. The provider sees
  only bounded-status labels, the manager-selected CNS/opcode/command-id
  echoes, the three parked-slot identities, the parsed VID/SSVID, and
  the doorbell side-effect labels. The cap-side dispatch admission
  carries `authority_result=ok`, `result=ok`,
  `register_write=performed`, `side_effect=mmio-write-performed`,
  `sq_doorbell_written=true`, `cq_doorbell_written=true`,
  `completion_consumed=true`, `cq_status=0x0000`,
  `prp_source=manager-ledger`, and `host_physical_user_visible=false`.
  The kernel diagnostic is `nvme: brokered-admin-identify
  owner=cloud-nvme model=cloud-bounce trigger=manager-op admin_sq_slot=1
  admin_cq_slot=2 admin_data_slot=3 cns=0x01 opcode=0x06
  command_id=0x0001 ... cqe_status=0x0000 cqe_command_id=0x0001
  sq_tail=1 cq_head=1 cq_phase=1 identify_vid=0x1b36
  identify_ssvid=0x1af4 sq_doorbell_written=performed
  cq_doorbell_written=performed completion_consumed=true
  prp_source=manager-ledger host_physical_user_visible=false
  proof_result=ok` (QEMU's `nvme` device reports PCI VID `0x1b36` and
  SSVID `0x1af4`, which the harness pins). The
  `cloud-nvme-admin-identify-smoke` provider exercises the inherited
  fail-closed raw-write claims (six in total: AQA/ASQ/ACQ + raw
  `CC.EN=1` + raw admin SQ tail/CQ head doorbells), invokes the
  controller-enable verb at `@6`, invokes the admin IDENTIFY verb at
  `@7`, and emits one headline `cloudboot-evidence:
  provider-nvme-admin-identify <token>` marker plus three supplementary
  `[cloud-nvme-admin-identify-smoke] discipline-*` lines that
  re-anchor the contract within the per-call `Console.writeLine` bound.
  Proof: `make run-cloud-provider-nvme-admin-identify`. Future work (not yet implemented): I/O queue creation, `READ`/`WRITE`,
  `Interrupt.wait`/`Interrupt.acknowledge` admin-completion handoff,
  device-autonomous MSI-X delivery, host-physical/IOVA export,
  provider-authored SQE/PRP/SGL bytes, provider-authored doorbell
  offsets/values, and live cloud traffic.
- `cloud-prod-nvme-admin-completion-wait-ack-local-proof` moves the admin
  IDENTIFY completion handoff off manager-internal CQ polling onto the
  production `Interrupt.wait` / `Interrupt.acknowledge` path. The
  admission-check-only production `Interrupt` grant
  (`interrupt_grant_source_prod`, `wait/acknowledge=admission-check-only`) is
  replaced by a fully-programmed cap-waiter MSI-X route on the **same** NVMe BDF
  (`kernel/src/cap/nvme_admin_completion_wait_ack_proof.rs`, table entry 0): its
  `init` registers + claims the route under `ManagerGrantSource`, programs the
  MSI-X table entry mask-first with the kernel-authored
  `(message_address, message_data)`, attaches it to the device manager, arms the
  deferred-LAPIC-EOI gate, and unmasks. The admin IDENTIFY is split into two
  manager-owned verbs: `DeviceMmio.brokeredNvmeAdminSubmit` (schema `@8`) authors
  the SQE and rings the admin SQ tail doorbell (no CQ consumed,
  `completion_consumed=false`), and `DeviceMmio.brokeredNvmeAdminComplete`
  (schema `@9`) polls/consumes the admin CQ (the manager-owned CQ status/CID
  check is preserved), parses VID/SSVID, and advances the admin CQ head doorbell.
  Both reuse the shared `NvmeBrokeredAdminOpResult` schema struct. The handoff
  state machine is ordered and one-shot: `brokeredNvmeAdminSubmit` (@8) records
  the exact live admin SQ/CQ/data slots and generations; `Interrupt.wait` is
  admitted once for that submitted state, revalidates those live DMA records, and
  consumes the wait phase; `brokeredNvmeAdminComplete` (@9) is admitted only
  after the wait phase; and `Interrupt.acknowledge` is admitted once only after
  the completion phase has been recorded. Hostile complete-before-wait,
  ack-before-complete, repeat wait, repeat complete, repeat ack, and
  submit-then-`DMABuffer.freeBuffer` attempts fail closed before injecting extra
  dispatch, retiring an extra EOI, or freeing/scrubbing the manager-owned admin
  pages. Between the two verbs the provider calls `Interrupt.wait` -- which
  injects exactly one bounded, non-autonomous
  `device_interrupt::handle_lapic_delivery` dispatch on the bound route
  (`result=nvme-admin-completion-wait-ack-dispatch-consumed`,
  `real_interrupt_delivery=kernel-injected-dispatch`, delivery count `+1`, one
  deferred LAPIC EOI armed) -- then after the completion verb calls
  `Interrupt.acknowledge` to retire exactly that deferred EOI
  (`hardware_dispatch_ack_delta=1`). The chain is: read-only bind -> reset-only
  `CC.EN=0` -> manager-owned admin buffer materialization ->
  `brokeredNvmeControllerEnable` -> `brokeredNvmeAdminSubmit` (@8) -> admin SQ
  tail doorbell -> `Interrupt.wait` wake -> `brokeredNvmeAdminComplete` (@9) ->
  admin CQ completion consumed -> admin CQ head doorbell advanced ->
  `Interrupt.acknowledge` deferred EOI retired. On `Interrupt` cap release the
  kernel requires exactly one observed dispatch, exactly one observed ack, and
  the terminal acked handoff state, then runs the masked-no-wake + reassign +
  stale-handle/stale-token assertion chain and emits exactly one headline
  `cloudboot-evidence:
  provider-nvme-admin-completion-wait-ack <token>` marker labeled
  `admin_completion_wake=provider-cap-side-injected
  device_autonomous_raise=not-claimed`. The wake is the same bounded
  kernel-injected cap-waiter model as `make run-cloud-provider-cap-waiter`; this
  proof does **not** claim a device-autonomously-raised NVMe MSI-X completion
  interrupt. Proof: `make run-cloud-provider-nvme-admin-completion-wait-ack`.
  Future work (not yet implemented): I/O queue creation, `READ`/`WRITE`,
  `BlockDevice`/filesystem integration, device-autonomous MSI-X delivery,
  host-physical/IOVA export, and live cloud traffic.
- `cloud-prod-nvme-io-queue-create-local-proof` adds the single I/O queue pair
  (queue id 1) on top of the admin chain. After the combined poll-based admin
  IDENTIFY (`DeviceMmio.brokeredNvmeAdminIdentify @7`, VID `0x1b36`), the manager
  authors the two queue-establishing admin commands behind parameterless
  per-command verbs: `DeviceMmio.brokeredNvmeCreateIoCqSubmit` (schema `@10`,
  opcode `0x05`, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 PC=1 IEN=0, PRP1 =
  manager-owned I/O CQ base page) and `DeviceMmio.brokeredNvmeCreateIoSqSubmit`
  (schema `@11`, opcode `0x01`, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 =
  CQ id 1 | PC<<16, PRP1 = manager-owned I/O SQ base page). The opcode/CDWs are
  manager-selected; the provider supplies nothing (widening `@8` with a
  command-selector parameter was rejected because it would let a provider author
  arbitrary admin opcodes). Each SUBMIT verb authors the SQE at the next admin SQ
  index, rings the admin SQ tail doorbell, and records the in-flight create; the
  completion of each is consumed through the shared
  `DeviceMmio.brokeredNvmeAdminComplete` (`@9`, now command-aware: it reads the
  admin CQ entry at the recorded index, checks status/CID, and advances the CQ
  head doorbell) after one provider-cap-side `Interrupt.wait`, and the deferred
  LAPIC EOI is retired by one `Interrupt.acknowledge`. The cap-waiter route
  (`kernel/src/cap/nvme_io_queue_create_proof.rs`) drives **two** bounded
  kernel-injected dispatch + deferred-EOI cycles -- one per create -- and its
  ordered handoff enforces CREATE I/O CQ before CREATE I/O SQ, one create at a
  time, with submit-then-`DMABuffer.freeBuffer`, repeat-wait, and
  ack-before-complete attempts failing closed. The I/O CQ/SQ base pages are
  manager-owned brokered bounce buffers (parked-pool slots 3/4, userspace slots
  4/5); their PRP1 is never exported. On `Interrupt` cap release the kernel
  requires both creates completed (CQE status 0), exactly two observed
  dispatches, two observed acks, the idle terminal handoff, and the
  masked-no-wake + reassign + stale-handle chain, then emits one
  `cloudboot-evidence: provider-nvme-io-queue-create <token>` marker labeled
  `io_queue_create_wake=provider-cap-side-injected
  device_autonomous_raise=not-claimed io_command=create-only
  io_read=not-attempted io_sq_doorbell=not-attempted`. Proof: `make
  run-cloud-provider-nvme-io-queue-create`. Future work (not yet implemented): the I/O SQ tail doorbell (`0x1008`), `READ`/`WRITE`, the I/O data
  page, the I/O-completion route, `BlockDevice`/filesystem integration,
  device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud
  traffic.
- `cloud-prod-nvme-io-read-local-proof` adds one bounded I/O `READ` (LBA 0, 1
  block, NSID 1) on top of the live I/O queue pair. After the two CREATE I/O
  queue commands, the manager authors the entire READ SQE behind two
  parameterless per-command verbs: `DeviceMmio.brokeredNvmeIoReadSubmit` (schema
  `@12`) writes CDW0 (opcode `0x02` | command-id<<16), NSID 1, MPTR 0, PRP1 =
  manager-owned I/O read-data page (parked-pool slot 5), PRP2 0, SLBA 0
  (CDW10/CDW11), NLB 0 = "1 block" (CDW12 bits 15:0) at I/O SQ index 0 and rings
  the I/O SQ tail doorbell (`0x1008`); `DeviceMmio.brokeredNvmeIoReadComplete`
  (schema `@13`) polls the I/O CQ entry at index 0 for the phase flip, checks
  status/CID, advances the I/O CQ head doorbell (`0x100c`), reads the first bytes
  of the read-data page through the kernel mapping, and surfaces a bounded
  read-data digest (`readDataDigestLo`/`readDataDigestHi` = first 8 bytes,
  `readDataLen` = transferred length). The provider supplies no opcode/LBA/PRP/
  doorbell (a provider `write32(0x1008, ...)` path was rejected because it would
  break the no-provider-authored-command discipline; reusing the create/admin
  verbs was rejected because they are hardwired to the admin SQ/CQ doorbells and
  ledger). The cap-waiter route (`kernel/src/cap/nvme_io_read_proof.rs`) drives
  **three** bounded kernel-injected dispatch + deferred-EOI cycles -- two creates
  plus one read -- with submit-then-`DMABuffer.freeBuffer`, repeat-wait, and
  ack-before-complete attempts failing closed; the read-data page is a
  manager-owned brokered bounce buffer (parked-pool slot 5, userspace slot 6)
  whose PRP1 is never exported, and the manager reads the completed block bytes
  only through the kernel mapping. On `Interrupt` cap release the kernel requires
  both creates and the read completed (CQE status 0), the verified block bytes
  (`readDataLen > 0` and a non-zero digest), exactly three observed dispatches,
  three observed acks, the idle terminal handoffs, and the masked-no-wake +
  reassign + stale-handle chain, then emits one
  `cloudboot-evidence: provider-nvme-io-read <token>` marker labeled
  `io_read_wake=provider-cap-side-injected device_autonomous_raise=not-claimed
  io_command=read io_read=completed io_sq_doorbell=performed
  io_cq_completion=polled-io-cq` plus `io_read_block_bytes=<digest>
  read_data_len=512`. The local QEMU smoke seeds the NVMe backing file's first
  sector with a known 16-byte pattern so the digest proves an actual byte
  transfer, not merely that a CQE arrived. Proof: `make
  run-cloud-provider-nvme-io-read`. The same marker shape also passed live on
  GCE run `1780806087-bf69` (`make
  cloudboot-gcp-storage-nvme-io-read-test`) against a Persistent Disk NVMe
  controller with `vendor.1ae0`, `device.001f`, `live_cloud=gce-persistent-disk`,
  and a 512-byte READ digest prefix
  `eb3c904c494d494e4520200002000000`. Future work (not yet implemented): a
  dedicated I/O-completion `Interrupt` route on live cloud, `WRITE`,
  multi-block/second-LBA reads, `BlockDevice`/filesystem integration,
  device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud
  coverage beyond this one GCE PD read.
- `cloud-prod-nvme-io-write-local-proof` adds one bounded I/O `WRITE` (LBA 0, 1
  block, NSID 1) of a fixed manager pattern on top of the live I/O queue pair,
  proven durable by reading it back. After the two CREATE I/O queue commands, the
  manager authors the entire WRITE SQE behind two parameterless per-command verbs:
  `DeviceMmio.brokeredNvmeIoWriteSubmit` (schema `@14`) pre-fills the manager-owned
  I/O write-data page (parked-pool slot 6, userspace slot 7) with the fixed 16-byte
  signature `facefeedcafebabe1122334455667788` repeated across the block, then
  writes CDW0 (opcode `0x01` | command-id<<16), NSID 1, MPTR 0, PRP1 = that page,
  PRP2 0, SLBA 0, NLB 0 = "1 block" at I/O SQ index 0 and rings the I/O SQ tail
  doorbell (`0x1008`); `DeviceMmio.brokeredNvmeIoWriteComplete` (schema `@15`) polls
  the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the
  I/O CQ head doorbell (`0x100c`), reads the first bytes of the write-data page
  through the kernel mapping, and surfaces a bounded written-pattern digest (carried
  in the shared `readDataDigestLo`/`readDataDigestHi`/`readDataLen` fields). The
  landed I/O `READ` (`@12`/`@13`) is then reused unchanged to read LBA 0 back into
  the read-data page (slot 5) at the next I/O SQ/CQ index (1, since the WRITE
  consumed index 0); the provider supplies no opcode/LBA/PRP/pattern/doorbell, and
  a new schema field was deliberately avoided -- the durability match is computed
  kernel-side by comparing the written-pattern digest with the read-back digest.
  The cap-waiter route (`kernel/src/cap/nvme_io_write_proof.rs`) drives **four**
  bounded kernel-injected dispatch + deferred-EOI cycles -- two creates, one write,
  one read-back -- with submit-then-`DMABuffer.freeBuffer`, repeat-wait, and
  ack-before-complete attempts failing closed; the write-data and read-data pages
  are manager-owned brokered bounce buffers whose PRP1 is never exported. On
  `Interrupt` cap release the kernel requires both creates, the write, and the
  read-back completed (CQE status 0), non-zero digests, exactly four observed
  dispatches and acks, the idle terminal handoffs, the masked-no-wake + reassign +
  stale-handle chain, and the read-back digest matching the written pattern, then
  emits one `cloudboot-evidence: provider-nvme-io-write <token>` marker labeled
  `io_command=write io_write=completed io_sq_doorbell=performed
  io_cq_completion=polled-io-cq write_pattern=<digest> write_readback_match=true`.
  The local QEMU smoke seeds the backing file's first sector with a distinct
  sentinel so the read-back of the manager pattern proves the WRITE transferred the
  bytes. Proof: `make run-cloud-provider-nvme-io-write`. Future work (not yet implemented): a dedicated I/O-completion `Interrupt` route distinct from the
  admin/create/read route, multi-block/second-LBA/second-NSID I/O, flush/FUA/DSM,
  `BlockDevice`/filesystem integration, device-autonomous MSI-X delivery,
  host-physical/IOVA export, and live cloud traffic.
- `cloud-prod-nvme-io-second-lba-local-proof` generalizes the manager-authored
  data path beyond the hardwired `SLBA 0`: it proves LBA addressing actually
  selects the block by driving three sequential I/O commands on the live queue
  pair through two new parameterless verbs
  (`DeviceMmio.brokeredNvmeIoSecondLbaSubmit @16` /
  `brokeredNvmeIoSecondLbaComplete @17`), selected by a kernel-owned phase
  counter: phase 0 reads LBA 0 (the distinctness baseline, read-data slot 5),
  phase 1 pre-fills the write-data page (slot 6) with a fixed LBA-1-distinct
  16-byte pattern `0123456789abcdeffedcba9876543210` and writes it to **LBA 1**
  (opcode `0x01`, CDW10 = SLBA low = 1, CDW11 = SLBA high = 0), and phase 2 reads
  **LBA 1** back. Because the landed `@12`-`@15` verbs hardwire SLBA 0 (and their
  I/O index depends on the io-write feature), this proof implies only
  `cloud_nvme_io_queue_create_proof` and authors its own SQEs at I/O SQ indices
  0/1/2; the provider supplies no opcode/LBA/PRP/pattern/doorbell. No schema
  field was added -- the LBA-1 read-back match (LBA-1 read digest == LBA-1 write
  pattern) and the LBA distinctness (LBA-1 read digest != LBA-0 read digest) are
  computed kernel-side across the three recorded phase digests. The cap-waiter
  route (`kernel/src/cap/nvme_io_second_lba_proof.rs`) drives **five** bounded
  kernel-injected dispatch + deferred-EOI cycles (two creates + three I/O phases),
  with submit-then-`DMABuffer.freeBuffer`, repeat-wait, and ack-before-complete
  attempts failing closed. On `Interrupt` cap release the kernel requires both
  creates and all three phases completed (CQE status 0), non-zero digests, five
  observed dispatches and acks, the masked-no-wake + reassign + stale-handle
  chain, `second_lba_readback_match=true`, and `lba_distinct_from_zero=true`, then
  emits one `cloudboot-evidence: provider-nvme-io-second-lba <token>` marker
  labeled `io_command=second-lba io_second_lba=1 second_lba_readback_match=true
  lba_distinct_from_zero=true io_sq_doorbell=performed
  io_cq_completion=polled-io-cq`. The local QEMU smoke seeds the backing file's
  first sector with the distinct sentinel `deadbeefcafebabe0102030405060708` so the
  LBA-0 read returns content distinct from the LBA-1 pattern. Proof:
  `make run-cloud-provider-nvme-io-second-lba`. Future work (not yet implemented): a dedicated I/O-completion `Interrupt` route, multi-block (NLB > 0) I/O,
  a third LBA or second namespace, flush/FUA/DSM, `BlockDevice`/filesystem
  integration (now unblocked by this LBA parameterization), device-autonomous
  MSI-X delivery, host-physical/IOVA export, and live cloud traffic.
- `cloud-prod-nvme-io-multiblock-local-proof` generalizes the manager-authored
  data path beyond one logical block: it proves the authored SQE drives a
  transfer larger than a single block by driving two sequential I/O commands on
  the live queue pair through two new parameterless verbs
  (`DeviceMmio.brokeredNvmeIoMultiblockSubmit @18` /
  `brokeredNvmeIoMultiblockComplete @19`), selected by a kernel-owned phase
  counter: phase 0 pre-fills the write-data page (slot 6) with two distinct
  16-byte patterns -- block 0 = `112233445566778899aabbccddeeff00`, block 1 =
  `f0e1d2c3b4a5968778695a4b3c2d1e0f` -- over 1024 B and writes both blocks to
  **LBA 2** (opcode `0x01`, NLB = block_count - 1 = 1, CDW10 = 2; PRP1 = slot 6,
  PRP2 = 0, since the 1024 B transfer fits one 4 KiB page), and phase 1 reads
  **LBA 2** back into the read-data page (slot 5). Because the landed `@12`-`@17`
  verbs hardwire a single block (`block_count = 1`), this proof implies only
  `cloud_nvme_io_queue_create_proof` and authors its own SQEs at I/O SQ indices
  0/1; the provider supplies no opcode/LBA/count/PRP/pattern/doorbell. No schema
  field was added for the second block's digest -- the existing
  `readDataDigestLo`/`Hi` carry block 0's first 8 bytes for userspace, while the
  per-block match (read block 0 == written pattern-0, read block 1 == written
  pattern-1) and the block-distinctness (block 0 digest != block 1 digest,
  proving the second 512 B block actually transferred) are computed kernel-side
  across the two recorded phase digest pairs and attested in the headline marker.
  The cap-waiter route (`kernel/src/cap/nvme_io_multiblock_proof.rs`) drives
  **four** bounded kernel-injected dispatch + deferred-EOI cycles (two creates +
  two I/O phases), with submit-then-`DMABuffer.freeBuffer`, repeat-wait, and
  ack-before-complete attempts failing closed. On `Interrupt` cap release the
  kernel requires both creates and both phases completed (CQE status 0), non-zero
  digests, four observed dispatches and acks, the masked-no-wake + reassign +
  stale-handle chain, `multiblock_block0_match=true`,
  `multiblock_block1_match=true`, and `multiblock_blocks_distinct=true`, then
  emits one `cloudboot-evidence: provider-nvme-io-multiblock <token>` marker
  labeled `io_command=multiblock io_slba=2 io_nlb=1 io_block_count=2
  prp2_zeroed=true multiblock_block0_match=true multiblock_block1_match=true
  io_sq_doorbell=performed io_cq_completion=polled-io-cq`. Proof:
  `make run-cloud-provider-nvme-io-multiblock`. Future work (not yet implemented): a dedicated I/O-completion `Interrupt` route, NLB > 1 requiring a PRP
  list / second mapped page, a third LBA or second namespace, flush/FUA/DSM,
  wrapping the brokered READ/WRITE behind a userspace-served `BlockDevice` cap
  (now has both LBA selection and >1-block transfer as prerequisites),
  device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud
  traffic.
- `cloud-prod-nvme-io-synchronous-poll-read-local-proof` collapses the four-call
  submit/wait/complete/ack NVMe I/O lifecycle into ONE synchronous
  `CapObject::call` -- the shape `BlockDevice.readBlocks @0` requires. It adds two
  parameterless single-call verbs (`DeviceMmio.brokeredNvmeIoSyncWrite @20` /
  `brokeredNvmeIoSyncRead @21`), each mirroring the combined
  `brokeredNvmeAdminIdentify @7`: the manager pre-fills the write-data page (slot
  6) with `112233445566778899aabbccddeeff00` (block 0) and
  `f0e1d2c3b4a5968778695a4b3c2d1e0f` (block 1), authors the SQE (WRITE opcode
  `0x01` at I/O SQ index 0 / READ opcode `0x02` at index 1; NSID 1, SLBA 2, NLB =
  1 / two 512 B blocks, PRP1 = data page, PRP2 = 0), rings the I/O SQ tail
  doorbell (`0x1008`), polls the I/O CQ entry phase bit to completion within a
  bounded budget, advances the I/O CQ head doorbell (`0x100c`), and reads block
  0/block 1 back -- all inside one cap call, with **no `Interrupt.wait`** on the
  I/O data path. The two CREATE I/O queue commands still complete through the
  cap-waiter `Interrupt.wait`/`acknowledge` path, so the route
  (`kernel/src/cap/nvme_io_sync_read_proof.rs`) drives only **two** bounded
  kernel-injected dispatch + deferred-EOI cycles. The single-call verbs report
  `sqDoorbellWritten`, `cqDoorbellWritten`, and `completionConsumed` all true in
  one result; the block-1 match and block-distinctness are computed kernel-side
  across the recorded WRITE/READ digest pairs. On `Interrupt` cap release the
  kernel requires both creates and both single-call I/O commands completed (CQE
  status 0), non-zero digests, two observed dispatches and acks, the
  masked-no-wake + reassign + stale-handle chain, `sync_block0_match=true`,
  `sync_block1_match=true`, and `sync_blocks_distinct=true`, then emits one
  `cloudboot-evidence: provider-nvme-io-sync-read <token>` marker labeled
  `io_command=sync-read io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true
  sync_block0_match=true sync_block1_match=true sync_blocks_distinct=true
  io_sq_doorbell=performed io_cq_completion=polled-io-cq-single-call
  interrupt_wait=not-used`. Proof: `make run-cloud-provider-nvme-io-sync-read`.
  This closes concern (c) of the `BlockDevice`-shaped read gap (lifecycle
  collapse) without touching `BlockDeviceCap`/`crate::virtio` (concern a), the
  manager-op routing into a generic cap (concern b), a dedicated I/O-completion
  `Interrupt` route, or a PRP list (NLB > 1). Future work (not yet implemented): introducing an NVMe-backed `BlockDeviceCap` whose `readBlocks @0` arm
  calls this single-call op, a `readonly_fs`-style consumer over it,
  device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud
  traffic.
- `cloud-prod-nvme-io-sync-read-block-bytes-local-proof` surfaces the full
  read-back **bytes**. It adds one verb,
  `DeviceMmio.brokeredNvmeIoSyncReadBytes @22 () ->
  NvmeBrokeredAdminOpReadBytesResult`, that reuses the landed single-call
  poll-read body (`nvme_brokered_io_sync_command`) unchanged but returns the
  entire 1024 B read-back (block 0 ‖ block 1), read through the kernel mapping,
  as the inline `data :Data` field of a new narrow result struct -- the
  full-bytes shape `BlockDevice.readBlocks @0 -> (data :Data)` requires --
  instead of folding it to an 8-byte digest. The provider issues `@20` (WRITE)
  then `@22` (READ-bytes) as two synchronous cap calls and compares the returned
  `data` byte-for-byte to the reconstructed manager-authored page, asserting the
  two 512 B halves differ; the kernel still attests per-block match and
  distinctness in the release marker. No host-physical/IOVA address crosses the
  boundary -- only the content bytes the caller already authored. The cap-waiter
  route (`kernel/src/cap/nvme_io_sync_read_bytes_proof.rs`) is a clone of the
  sync-read proof that emits one
  `cloudboot-evidence: provider-nvme-io-sync-read-bytes <token>` marker labeled
  `io_command=sync-read-bytes io_slba=2 io_nlb=1 io_block_count=2
  read_data_len=1024 data_return=inline-bytes data_block0_match=true
  data_block1_match=true data_blocks_distinct=true
  io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used`. Proof:
  `make run-cloud-provider-nvme-io-sync-read-bytes`. This closes concern (b) of
  the `BlockDevice`-shaped read gap (full-bytes return) without touching
  `BlockDeviceCap`/`crate::virtio` (concern a), arbitrary `(startLba, count)`
  parameterization, a dedicated I/O-completion `Interrupt` route, or a PRP list
  (NLB > 1). Future work (not yet implemented): an NVMe-backed
  `BlockDeviceCap` backend enum whose `readBlocks @0` arm calls this op,
  arbitrary-LBA routing, device-autonomous MSI-X delivery, host-physical/IOVA
  export, and live cloud traffic.
- `cloud-prod-nvme-blockdevice-fixed-lba-read-arm-local-proof` makes the NVMe
  namespace consumable through the SAME `BlockDevice.readBlocks @0` interface a
  filesystem consumer calls (proof-fixed LBA arm). It replaces `BlockDeviceCap`'s
  bare `device_index: usize` with a `BlockDeviceBackend` enum
  (`kernel/src/cap/block_device.rs`): the always-built `Virtio { device_index }`
  arm (behavior-identical to today, verified by `make run-storage-fs`) and, under
  `cloud_nvme_blockdevice_read_proof`, an `NvmeBrokered { handle, owner }` arm.
  The `NvmeBrokered` arm's `readBlocks @0` accepts ONLY `startLba == 2 && count ==
  2` (fails closed with `BlockDevice.readBlocks NVMe arm fixed to SLBA 2 NLB 1` on
  any other window) and drives the landed `nvme_brokered_io_sync_read_bytes_op_for_cap`
  (`@22`) body into a local 1024 B buffer surfaced as inline `Data` through the
  same `read_blocks_results` builder the virtio arm uses; `writeBlocks`/`flush`
  fail closed (read-only namespace) and `info` returns the fixed geometry. A new
  bootstrap grant arm mints the `NvmeBrokered` cap bound to the SAME live
  `device_mmio` handle/owner the production grant source staged
  (`devicemmio_grant_source_prod::live_handle_for_nvme_blockdevice`), so the
  `device_mmio` grant must precede the `block_device` grant in the manifest cap
  list. The provider drives the full bring-up (reset → enable → IDENTIFY @7 →
  CREATE I/O CQ @10 → CREATE I/O SQ @11 → `@20` WRITE) and issues the TERMINAL read
  through `BlockDevice.readBlocks(2, 2)` instead of a raw `DeviceMmio @22`. The
  cap-waiter route (`kernel/src/cap/nvme_blockdevice_read_proof.rs`, a clone of the
  sync-read-bytes proof) emits one
  `cloudboot-evidence: provider-nvme-blockdevice-read <token>` marker labeled
  `read_path=blockdevice-readblocks read_iface=BlockDevice read_method=0 io_slba=2
  io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes
  nvme_arm_fixed_lba=true arbitrary_lba=not-supported`. Proof:
  `make run-cloud-provider-nvme-blockdevice-read`. This closes concern (a) of the
  `BlockDevice`-shaped read gap (the same schema method, not the bespoke `@22`
  verb), restricted to the proof-fixed window. Future work (not yet implemented): arbitrary `(startLba, count)` parameterization, NVMe write/flush
  durability through `BlockDevice`, a `readonly_fs`-style filesystem mounted over
  the NVMe `BlockDevice` cap, a dedicated I/O-completion `Interrupt` route, NLB > 1
  with a PRP list, and graduating the NVMe data plane out of the per-proof feature
  into always-built production.
- `cloud-prod-nvme-blockdevice-arbitrary-lba-read-local-proof` widens the
  `NvmeBrokered` arm off the hardwired SLBA 2 / NLB 1: `readBlocks @0` now honors an
  ARBITRARY `(startLba, count)` window (read-only, bounded to one PRP1 page). The
  shared single-call body `nvme_brokered_io_sync_command`
  (`kernel/src/device_manager/stub.rs`) gains explicit `slba`/`block_count` fields on
  `SyncIoParams` and authors CDW10/CDW11 (SLBA) and CDW12 (NLB) from them instead of
  the module constants; the transfer length is `block_count * 512`, bounded fail-closed
  to one 4 KiB PRP1 page (`block_count <= 8`) so PRP2 stays 0. The existing
  `@20`/`@21`/`@22` callers keep passing the proof-fixed SLBA 2 / count 2, so their
  behavior is byte-identical (regression: `make run-cloud-provider-nvme-io-sync-read-bytes`).
  A new parameterized op `nvme_brokered_io_sync_read_window_op_for_cap(handle, owner,
  slba, count, out_data)` rotates the I/O SQ/CQ index off the kernel-side read sequence
  (window 0 at index 1 after the WRITE at index 0, window 1 at index 2) so each
  completion is polled at the CQ slot the controller actually writes. Under
  `cloud_nvme_blockdevice_arbitrary_lba_proof` (implies and supersedes
  `cloud_nvme_blockdevice_read_proof`), `BlockDeviceCap::nvme_read_blocks` admits any
  `1 <= count <= 8` window with `startLba + count <= namespace blocks` (the
  IDENTIFY-derived NSZE reported through `info @2` -- 16 MiB / 512 = 32768 on the
  QEMU fixture image; see the READ-arm graduation entry), and fails closed with
  distinct errors for `count == 0` / `count > 8` (`... count out of range (1..=8)`) and a
  window past the namespace end (`... window past namespace end`). The proof
  (`kernel/src/cap/nvme_blockdevice_arbitrary_lba_proof.rs`, a clone of the fixed-LBA
  module) drives the full bring-up plus the `@20` WRITE (seeding LBA 2 = pattern-0,
  LBA 3 = pattern-1), then issues TWO distinct `readBlocks` windows -- `readBlocks(0, 1)`
  (zero-filled LBA 0) and `readBlocks(3, 2)` (LBA 3 = pattern-1, LBA 4 = zero-filled) --
  comparing each returned `data` byte-for-byte to the manager-authored content and
  asserting the two windows return distinct content. It emits one
  `cloudboot-evidence: provider-nvme-blockdevice-arbitrary-lba-read <token>` marker
  labeled `arbitrary_lba=supported window0_slba=0 window0_count=1 window1_slba=3
  window1_count=2 windows_distinct=true prp_pages=single nvme_arm_fixed_lba=false`.
  Proof: `make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read`. With this,
  the NVMe namespace is readable through
  `BlockDevice.readBlocks @0` at the LBA the consumer names. Future work (not yet
  implemented): NLB spanning more than one PRP1 page (`count > 8`) with a PRP
  list, NVMe write/flush durability through `BlockDevice`, a `readonly_fs`-style
  filesystem mounted over the NVMe `BlockDevice` cap, a dedicated I/O-completion
  `Interrupt` route, and graduating the NVMe data plane out of the per-proof feature
  into always-built production.
- `cloud-prod-nvme-blockdevice-writeblocks-durability-arm-local-proof` arms the
  `NvmeBrokered` arm's `writeBlocks @1` (read-only until now): it drives the brokered
  NVMe sync WRITE with the caller-supplied `(startLba, count, data)` and proves
  write-then-read-back durability. A new parameterized op
  `nvme_brokered_io_sync_write_window_op_for_cap(handle, owner, slba, count, in_data)`
  (`kernel/src/device_manager/stub.rs`) mirrors the arbitrary-window READ entry but
  rotates the I/O index off a kernel-side write sequence (`next_write_io_index()` =>
  index 0, before the read-back at index 1). The shared single-call body
  `nvme_brokered_io_sync_command` gains a third fill mode -- a `write_payload:
  Option<&[u8]>` that copies the caller's `count * 512` bytes into block 0..count of
  the manager-owned write-data page (slot 6) through the HHDM mapping before the WRITE
  SQE is authored -- beside the fixed `prefill_pattern` and the readonly_fs
  `seed_image` modes (both unchanged: regressions
  `make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read` and
  `make run-storage-fs` stay green). Under `cloud_nvme_blockdevice_writeblocks_proof`
  (implies and supersedes `cloud_nvme_blockdevice_arbitrary_lba_proof`),
  `BlockDeviceCap::nvme_write_blocks` admits any `1 <= count <= 8` window with
  `startLba + count <= namespace blocks` and `data.len() == count * 512`, failing
  closed with distinct errors for zero count, over-capacity, past-namespace-end, and
  length mismatch; `info @2` reports `readOnly = false`; `flush @3` stays fail-closed
  (a real NVMe FLUSH, opcode `0x00`, is a distinct verb -- see the `flush @3`
  capability below). The
  proof (`kernel/src/cap/nvme_blockdevice_writeblocks_proof.rs`, a clone of the
  arbitrary-LBA module) drives the full bring-up then `writeBlocks(5, 2, data)` with a
  caller-authored, non-zero, two-distinct-block 1024 B payload, followed by
  `readBlocks(5, 2)`, comparing the read-back byte-for-byte to the bytes written. It
  emits one `cloudboot-evidence: provider-nvme-blockdevice-writeblocks-durability
  <token>` marker labeled `write_path=blockdevice-writeblocks write_method=1
  write_slba=5 write_count=2 write_data_len=1024 readback_data_len=1024
  write_readback_match=true nvme_arm_read_only=false flush=fail-closed prp_pages=single`.
  Proof: `make run-cloud-provider-nvme-blockdevice-writeblocks-durability`. No
  schema/binding change (`writeBlocks @1` and `readBlocks @0` round-trip through
  existing bindings). Future work (not yet implemented): a `writable_fs` /
  `persistent_store` consumer mounted over the NVMe `BlockDevice` write arm, a real
  NVMe FLUSH on `flush @3`, a dedicated I/O-completion `Interrupt` route on the data
  path, and graduating the NVMe data plane out of the per-proof feature into
  always-built production.

- `ddf-nvme-multiprp-blockdevice-window-local-proof` extends the same
  `BlockDevice.writeBlocks @1` / `readBlocks @0` round-trip to a three-page NVMe
  PRP window. Under `cloud_nvme_blockdevice_multiprp_window_proof`,
  `BlockDeviceCap::nvme_write_blocks` and `BlockDeviceCap::nvme_read_blocks` accept
  `count <= 24` for the local proof geometry while the default and older proof
  builds keep the one-page `count <= 8` bound. The shared
  `nvme_brokered_io_sync_command` body resolves primary read/write data pages from
  parked-pool slots 5/6, a manager-owned PRP-list page from slot 7, read extension
  pages from slots 8/9, and write extension pages from slots 10/11. For the
  `writeBlocks(5, 24, data)` and `readBlocks(5, 24)` proof window it authors PRP1 as
  the primary data page and PRP2 as a PRP-list page containing two little-endian
  page pointers, matching the NVMe PRP-list subset in NVMe Base Specification 1.4
  §4.3. The provider still supplies only inline `Data` through the `BlockDevice`
  schema and never sees a host physical address, IOVA, PRP1, PRP2, PRP-list page
  address, SQE byte, doorbell offset, or doorbell value. Requests with zero count,
  count 25, namespace overflow, or length mismatch fail closed before any I/O SQ
  doorbell write. The release marker includes full-transfer FNV-1a hashes for the
  WRITE and read-back records so the kernel-side proof is not limited to the first
  two 16-byte block digests; the userspace smoke also compares all 12 KiB
  byte-for-byte. Proof: `make run-cloud-provider-nvme-blockdevice-multiprp-window`.

- `cloud-prod-nvme-blockdevice-flush-local-proof` arms the `NvmeBrokered` arm's
  `flush @3` (fail-closed until now): it authors a real NVMe FLUSH (NVM command-set
  opcode `0x00`, NSID-scoped, no data transfer) through the brokered sync command
  machinery and proves a `writeBlocks` then `flush` returns CQE status `0` and the
  written block survives the flush. A new parameter-free op
  `nvme_brokered_io_sync_flush_op_for_cap(handle, owner)`
  (`kernel/src/device_manager/stub.rs`) drives the shared single-call body
  `nvme_brokered_io_sync_command` with a FLUSH `SyncIoParams` (`opcode = 0x00`,
  `command_id = 8`, `slba = 0`, `block_count = 0`), rotating the I/O index off a
  kernel-side flush sequence (`next_flush_io_index()` => index 1, after the WRITE at 0
  and before the read-back at 2). The shared body learns the FLUSH shape (gated on the
  opcode): it skips the one-PRP1-page data bound, authors the SQE with NSID only and
  `PRP1 = 0`/`PRP2 = 0`/`CDW10..15 = 0` (no data page touched), and the WRITE/READ
  data-bearing path stays byte-identical for non-FLUSH opcodes. Under
  `cloud_nvme_blockdevice_flush_proof` (implies and supersedes
  `cloud_nvme_blockdevice_writeblocks_proof`, the flush proof's true sibling, so the
  write/read arms and the whole brokered I/O chain are reused unchanged),
  `BlockDeviceCap::nvme_flush` returns `()` when the FLUSH was authored + the SQ
  doorbell rung + the completion consumed + CQE status `0`, failing closed otherwise.
  The proof (`kernel/src/cap/nvme_blockdevice_flush_proof.rs`, a clone of the
  writeblocks module) drives the full bring-up then `writeBlocks(5, 2, data)`,
  `flush()`, and `readBlocks(5, 2)`, comparing the post-flush read-back byte-for-byte
  to the bytes written. It emits one `cloudboot-evidence: provider-nvme-blockdevice-flush
  <token>` marker labeled `flush_path=blockdevice-flush flush_method=3
  nvme_flush_opcode=0x00 flush_cqe_status=0 write_then_flush_ok=true
  flush_data_transfer=none prp1=0 prp2=0 write_readback_after_flush_match=true
  reboot_persistence=deferred durability_proof=flush-completion-only
  virtio_flush_regression=green`. Proof:
  `make run-cloud-provider-nvme-blockdevice-flush`. No schema/binding change
  (`flush @3 () -> ()` round-trips through existing bindings with its empty
  params/result). Future work (not yet implemented): an NVMe reboot-persistence
  pass and crash-consistency where the FLUSH barrier specifically changes the survival
  outcome (a flushed write surviving a forced poweroff an unflushed one would not),
  routing `File.sync` / the writable-fs / persistent-store sync through this FLUSH, a
  dedicated I/O-completion `Interrupt` route on the data path, NLB>1 spanning multiple
  PRP pages with a PRP list, and graduating the NVMe data plane out of the per-proof
  feature into always-built production.

- `cloud-prod-nvme-blockdevice-reboot-persistence-local-proof` closes the
  reboot-persistence gap the flush proof named first: it proves a normally committed
  + FLUSHED write survives a CLEAN reboot through the same `BlockDevice` interface,
  the two-boot analogue of `run-storage-persist` on the NVMe arm. The Makefile recipe
  (`make run-cloud-provider-nvme-blockdevice-reboot-persistence`) creates ONE
  `nvme.raw` image and boots the non-`qemu` cloudboot kernel over it TWICE WITHOUT
  regenerating it between boots. The provider self-selects its boot phase by probing
  the LBA 5..6 window through `readBlocks(5, 2) @0` -- the data window itself is the
  guard sentinel: boot 1 reads back all-zero (fresh namespace), takes the writer
  branch, and issues `writeBlocks(5, 2, data) @1` + a real `flush() @3` (CQE status
  `0`); QEMU restarts against the SAME backing file and boot 2 reads back the known
  payload, takes NO writer branch, and the single read-back verifies persistence. The
  proof reuses the landed `writeBlocks @1` / `flush @3` / `readBlocks @0` arms and the
  brokered sync command machinery unchanged; the only kernel-internal additions are a
  flat per-boot single-call I/O op log (reworked from the flush proof's rigid
  WRITE -> FLUSH -> read-back state machine so the verifier boot can record a READ
  with no prior WRITE in the same boot), the data-window phase select, and the
  cross-boot `phase=1|2` marker labels. The cap-waiter route + headline marker come
  from `kernel/src/cap/nvme_blockdevice_reboot_persistence_proof.rs` (a clone of the
  flush module under `cloud_nvme_blockdevice_reboot_persistence_proof`, which implies
  and supersedes `cloud_nvme_blockdevice_flush_proof`). Each boot emits one
  `cloudboot-evidence: provider-nvme-blockdevice-reboot-persistence <token>` marker
  carrying its `phase` (`phase=1 ... write_then_flush_ok=true flush_cqe_status=0
  boot_role=writer-flush` on boot 1; `phase=2 ... reboot_persistence_match=true
  boot_role=verifier durability_proof=clean-reboot-persistence` on boot 2). The
  reboot-persistence gate is the cross-boot correlation: boot 1's persisted block
  digests equal boot 2's read-back block digests (and both equal the known payload).
  No schema/binding change. Future work (not yet implemented): crash-consistency
  where the FLUSH barrier specifically changes the survival outcome under an induced
  mid-flush crash (the analogue of `run-storage-writable-recovery`), routing
  `File.sync` / the writable-fs / persistent-store sync through this FLUSH, a dedicated
  I/O-completion `Interrupt` route on the data path, NLB>1 spanning multiple PRP pages,
  and graduating the NVMe data plane out of the per-proof feature into always-built
  production.

- `cloud-prod-nvme-blockdevice-flush-crash-consistency-local-proof` covers the
  flushed-write-survives half of crash-consistency: it proves a normally
  committed + FLUSHED write
  survives a FORCED poweroff (an abrupt `kill -9` of the QEMU process AFTER the flush
  barrier completed), the NVMe `BlockDevice` analogue of `run-storage-writable-recovery`.
  The Makefile recipe (`make run-cloud-provider-nvme-blockdevice-flush-crash-consistency`)
  creates ONE `nvme.raw` image, boots the non-`qemu` cloudboot kernel over it in the
  BACKGROUND (boot 1: empty namespace -> writer branch -> `writeBlocks(5, 2, data) @1`
  + real `flush() @3`, CQE status `0`), watches the kernel log for the bounded arming
  marker `[nvme-blockdevice-flush-crash-consistency] kernel: flushed write armed;
  awaiting forced poweroff`, `kill -9`s the QEMU PID (the forced poweroff AFTER the
  flush barrier), then boots a SECOND time over the SAME file WITHOUT regenerating it
  (boot 2: verifier -> single `readBlocks(5, 2) @0` read-back). The proof reuses the
  reboot-persistence predecessor's two-boot phase select, the landed
  `writeBlocks @1` / `flush @3` / `readBlocks @0` arms, and the brokered sync command
  machinery unchanged; the only kernel-internal additions over the predecessor are the
  phase-1 arm-and-spin window after the flush (`on_release` emits the arming marker and
  spins forever so the recipe can `kill -9` at that point) and the forced-poweroff
  marker labels. The cap-waiter route + headline marker come from
  `kernel/src/cap/nvme_blockdevice_flush_crash_consistency_proof.rs` (a clone of the
  reboot-persistence module under `cloud_nvme_blockdevice_flush_crash_consistency_proof`,
  which implies and supersedes `cloud_nvme_blockdevice_reboot_persistence_proof`). Boot
  1 emits one `cloudboot-evidence: provider-nvme-blockdevice-flush-crash-consistency
  <token>` marker carrying `phase=1 ... write_then_flush_ok=true flush_cqe_status=0
  armed_forced_poweroff=true boot_role=writer-flush-arm` before the spin; boot 2 emits
  one carrying `phase=2 ... flush_survives_forced_poweroff=true boot_role=verifier
  durability_proof=flush-survives-forced-poweroff`. The crash-consistency gate is the
  cross-boot correlation: boot 1's persisted block digests equal boot 2's read-back
  block digests (and both equal the known payload), AND boot 1 reached the arm-and-spin
  window (was forcibly killed, did not take the verifier branch). No schema/binding
  change. Scoped honestly: "an unflushed write rolls back" is NOT provable under QEMU's
  `-device nvme` `cache=writeback` model (the host page cache survives `kill -9`), so
  the differential-rollback half is NOT claimed
  (`unflushed_rollback=not-provable-under-qemu-nvme-model`). Future work (not yet implemented): a dedicated I/O-completion `Interrupt` route on the data path,
  NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the
  per-proof feature into always-built production. (Both higher-level consumer FLUSH
  routings are now closed: the writable-fs `File.sync` half by
  `cloud-prod-nvme-consumer-sync-to-flush-local-proof` and the persistent-`Store`
  put-commit half by `cloud-prod-nvme-persistent-store-sync-to-flush-local-proof`,
  both below.)

- `cloud-prod-nvme-dedicated-io-completion-interrupt-local-proof` moves the
  NVMe `BlockDevice.writeBlocks @1` / `readBlocks @0` data-completion handoff
  off the synchronous I/O-CQ poll return path and onto a dedicated data
  `Interrupt` route. The spec basis remains NVMe Base Specification 1.4
  submission/completion queue doorbells and completion entries (§3 controller
  registers, §4 queue management and CQ phase/status handling, §6 NVM WRITE /
  READ commands). The proof keeps table entry 0 for the CREATE I/O CQ/SQ admin
  completions and adds table entry 1 for the data I/O CQ completions; both
  routes are kernel-injected cap-waiter MSI-X routes, not a device-autonomous
  interrupt claim. The implementation entry points are
  `kernel/src/cap/nvme_io_completion_interrupt_proof.rs`
  (`init`, `invoke_wait`, `invoke_acknowledge`,
  `poll_blockdevice_completions`, `emit_marker`),
  `kernel/src/device_manager/stub.rs`
  (`nvme_brokered_io_completion_interrupt_submit_op_for_cap`,
  `nvme_brokered_io_completion_interrupt_complete_op_for_cap`,
  `nvme_io_completion_interrupt_submit_record_buffers_live`), and
  `kernel/src/cap/block_device.rs` (`nvme_interrupt_write_blocks`,
  `nvme_interrupt_read_blocks`, `call_with_context`). The manager still authors
  queue bases and PRP1 from the live `DMAPool` ledger, copies the caller's write
  payload into the parked write-data page, consumes the I/O CQ entry at
  `Interrupt.acknowledge`, advances the I/O CQ head doorbell (`0x100c`), and
  posts the deferred `BlockDevice` completion only after the bounded caller CQ
  has room. The proof (`make run-cloud-provider-nvme-io-completion-interrupt`)
  drives `writeBlocks(5, 2, data)`, waits/acks the data route, observes the
  deferred write completion, then drives `readBlocks(5, 2)`, waits/acks the data
  route, and receives the read bytes through the standard
  `block_device::read_blocks_results` data field. The headline marker
  `cloudboot-evidence: provider-nvme-io-completion-interrupt <token>` pins
  `create.entry.0`, `io.entry.1`, `data_route_distinct_from_create_route=true`,
  four dispatches/four deferred EOIs, both deferred `BlockDevice` completions
  posted, and a byte-for-byte write/read-back match. Scoped honestly: queue-base
  and PRP addresses remain hidden (`host_physical_user_visible=0`,
  `iova_export=disabled-future-only`, `prp_source=manager-ledger`); multi-PRP
  windows (`count > 8`), provider-written PRP/SGL/address lanes, live cloud, a
  second namespace, FUA/DSM, and device-autonomous MSI-X delivery remain future
  work.

- `cloud-prod-readonly-fs-over-nvme-blockdevice-local-proof` provides a
  `readonly_fs`-style consumer over the NVMe `BlockDevice` arm: the read-only
  filesystem mount reads its sectors through the NVMe `BlockDevice` cap instead
  of the kernel-owned virtio-blk free functions. `kernel/src/cap/readonly_fs.rs`
  gains a `BlockSource` seam abstracting the two reads a mount needs (device
  geometry + range read). The always-built `Virtio` variant routes to the same
  `crate::virtio` free functions, so `make run-storage-fs` stays byte-identical; the
  `Nvme` variant (built only under `cloud_readonly_fs_over_nvme_proof`) reads through
  a granted NVMe-backed `BlockDevice` -- geometry from the IDENTIFY Namespace
  claim (see the READ-arm graduation entry), each chunked range read through
  `nvme_brokered_io_sync_read_window_op_for_cap`, one 4 KiB PRP1 page per call.
  Because the brokered controller is brought up by the userspace provider, the NVMe
  root `Directory` (granted via `read_only_fs_root`) defers its mount-parse to the
  first `Directory.open`. The proof
  (`kernel/src/cap/readonly_fs_over_nvme_proof.rs`, a clone of the arbitrary-LBA
  module) drives the full bring-up, then seeds a tiny `CAPOSRO1` image through the
  repurposed `@20` op (one manager-baked sector per call: superblock @ LBA 0, entry
  table @ LBA 1, file data @ LBA 2), mounts the filesystem over the NVMe
  `BlockSource`, opens the one seeded file, reads it, and compares the bytes. It
  emits one `cloudboot-evidence: provider-readonly-fs-over-nvme <token>` marker
  labeled `read_path=readonly-fs-over-blockdevice fs_format=CAPOSRO1
  block_source=nvme-blockdevice file_match=true superblock_via_nvme=true
  entry_table_via_nvme=true extent_via_nvme=true` after the kernel verifies each
  read-back block-0 digest against the baked image. Proof:
  `make run-cloud-provider-readonly-fs-over-nvme`. The malformed-image fail-closed
  paths (bad superblock magic, out-of-range entry-table or file extent) are the
  unchanged shared `mount_root_inner`/`parse_entries` validation in
  `kernel/src/cap/readonly_fs.rs` -- the `BlockSource` seam swaps only the
  block-read backend, so the existing `MountError` checks covered by
  `make run-storage-fs` apply identically over the NVMe `BlockSource`; the NVMe
  arm additionally rejects an over-range range read with the arbitrary-LBA arm's
  fail-closed error. Future work (not yet implemented): a multi-file directory walk / `Directory.list` traversal over NVMe, files
  whose extents span many one-PRP1-page chunks, NVMe write/flush durability through
  `BlockDevice` (the image is seeded via the manager-owned `@20` op, not
  `writeBlocks`), a dedicated I/O-completion `Interrupt` route, and graduating the
  NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into
  always-built production.
- `cloud-prod-readonly-fs-over-nvme-multifile-dirwalk-local-proof` extends the
  read-only filesystem to multi-file directories: it lists a
  directory with more than one entry and reads two distinct files over the NVMe
  `BlockDevice` cap, one of which spans multiple 4 KiB chunks. The baked image
  (`kernel/src/cap/readonly_fs_over_nvme_multifile_proof.rs`, a clone of the
  single-file module) grows to 12 sectors -- superblock @ LBA 0, a two-record entry
  table @ LBA 1, a one-sector small file @ LBA 2, and a 9-sector large file @
  LBA 3..11 carrying a deterministic position-dependent byte pattern. The large
  `File.read` covers nine sectors, so the `read_range` chunk loop issues TWO
  `BlockDevice.readBlocks @0` calls (an 8-sector chunk @ LBA 3 + a 1-sector chunk @
  LBA 11) -- the multi-chunk path the single-file arm never exercised. The proof
  identifies each recorded read by `(slba, count)` and verifies it byte-for-byte
  with a per-read FNV-1a-64 over the full transfer (computed in `device_manager`
  alongside the block digests), so a dropped trailing chunk fails closed. Because
  per-sector seeding (12) plus the filesystem reads (5) issue 17 single I/O commands
  and the monotonic I/O SQ/CQ index must stay inside one first CQ pass, the build
  raises `device_manager::stub::NVME_IO_QUEUE_DEPTH` from 8 to 32 (create
  `cdw10=0x001f0001`); the change is inert for every other NVMe proof build. It
  emits one `cloudboot-evidence: provider-readonly-fs-over-nvme-multifile <token>`
  marker labeled `dir_entry_count=2 file_count=2 files_distinct=true
  large_file_full_match=true large_file_read_blocks_calls=2 superblock_via_nvme=true
  entry_table_via_nvme=true extents_via_nvme=true` (plus the single-file arm's
  discipline labels). Proof:
  `make run-cloud-provider-readonly-fs-over-nvme-multifile`; the virtio mount path
  stays byte-identical (`make run-storage-fs`). Future work (not yet implemented): NVMe write/flush durability through `BlockDevice`, a dedicated
  I/O-completion `Interrupt` route on the data path, NLB > 1 spanning multiple PRP
  pages in a single call, sub-directory trees, and graduating the NVMe data plane
  and the readonly_fs NVMe mount out of the per-proof feature into always-built
  production.
- `cloud-prod-persistent-store-over-nvme-blockdevice-local-proof` provides a
  writable consumer over the NVMe `BlockDevice` write arm: the disk-backed
  persistent `Store` mounts over the NVMe `BlockDevice` write arm and proves a
  put-then-get durability round-trip. `kernel/src/cap/persistent_store.rs` gains a
  read+write `BlockSource` seam (mirroring `readonly_fs::BlockSource` but with a
  `write_blocks` method). The always-built `Virtio` variant routes to the same
  `crate::virtio` free functions byte-identically (including the
  `data_region_base_lba()` installable-disk offset, folded into the variant), so
  `make run-storage-persist` stays green; the `Nvme` variant (built only under
  `cloud_persistent_store_over_nvme_proof`) reads through
  `nvme_brokered_io_sync_read_window_op_for_cap` and writes through
  `nvme_brokered_io_sync_write_window_op_for_cap`, one 4 KiB PRP1 page per call.
  Because the brokered controller is brought up by the userspace provider, the NVMe
  root `Store` (granted via `persistent_store`) defers its mount-parse to the first
  `Store` call. The proof (`kernel/src/cap/persistent_store_over_nvme_proof.rs`, a
  clone of the writeblocks module) drives the full bring-up, then seeds a `CAPOSST1`
  superblock + empty entry table through the repurposed `@20` op (superblock @
  LBA 0, entry table @ LBA 1), and exercises the granted `Store`: `Store.put` writes
  the data extent (LBA 2), entry-table sector, and superblock through
  `BlockDevice.writeBlocks @1`, and `Store.get` reads the extent back through
  `BlockDevice.readBlocks @0`. The kernel attests the put WRITE and get READ block-0
  digests both equal the payload digest and differ from the pre-put (zero) extent,
  and userspace compares the returned bytes byte-for-byte. It emits one
  `cloudboot-evidence: provider-persistent-store-over-nvme <token>` marker labeled
  `write_path=store-put-over-blockdevice-writeblocks
  read_path=store-get-over-blockdevice-readblocks consumer=persistent-store
  store_iface=Store block_iface=BlockDevice store_format=CAPOSST1 write_method=1
  read_method=0 put_get_roundtrip_match=true durability_attested=true
  virtio_regression=green`. Because the round-trip issues 8 single I/O commands
  (2 seed WRITEs + 2 deferred-mount READs + 3 `Store.put` WRITEs + 1 `Store.get`
  READ) whose last monotonic CQ head reaches 8 -- past the default depth-8 first
  pass -- the build raises `device_manager::stub::NVME_IO_QUEUE_DEPTH` from 8 to 16;
  the change is inert for every other NVMe proof build. Proof:
  `make run-cloud-provider-persistent-store-over-nvme`. No schema/binding change
  (`Store.put`/`get`/`has`/`delete` and `BlockDevice.writeBlocks @1`/`readBlocks @0`
  round-trip through existing bindings). Future work (not yet implemented):
  routing the writable filesystem (`CAPOSWF1`) over the NVMe write arm, a real NVMe
  FLUSH on `flush @3` (stays fail-closed), an NVMe reboot-persistence pass, a
  dedicated I/O-completion `Interrupt` route on the data path, NLB > 1 spanning
  multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature
  into always-built production.

- `cloud-prod-writable-fs-over-nvme-blockdevice-local-proof` mounts the full
  disk-backed writable filesystem over the NVMe arm
  (`kernel/src/cap/writable_fs.rs`, the `CAPOSWF1` node-table tree with `mkdir` /
  `rename` / `remove` and a fail-closed single-writer policy). `writable_fs`
  carries a read+write `BlockSource` seam
  mirroring `persistent_store`'s: the `Virtio` variant (built only in the qemu/
  installable storage builds) routes to the `crate::virtio` free functions
  byte-identically (folding the `data_region_base_lba()` offset), so
  `make run-storage-writable` / `make run-storage-writable-recovery` stay green; the
  `Nvme` variant (built only under `cloud_writable_fs_over_nvme_proof`) reads through
  `nvme_brokered_io_sync_read_window_op_for_cap` and writes through
  `nvme_brokered_io_sync_write_window_op_for_cap`, one 4 KiB PRP1 page per call.
  Because `writable_fs` uses a process-wide singleton volume, the NVMe `writable_fs_root`
  grant stages the live `device_mmio` handle and defers the singleton mount-parse to
  the first `Directory`/`File` call. The proof
  (`kernel/src/cap/writable_fs_over_nvme_proof.rs`, a clone of the persistent-store
  module that supersedes and drops it) seeds a `CAPOSWF1` superblock + root + one
  seeded file through the `@20` op, contiguously from LBA 256 (superblock @256, node
  table @257, seeded file extent @258), then exercises the granted filesystem. READ
  arm: opening the seeded file triggers the deferred mount, which reads the seeded
  extent (@258) back through `BlockDevice.readBlocks @0`; `File.read` returns the RAM
  copy the mount loaded. WRITE arm: `File.write` to a fresh file lands a bump-allocated
  data extent (@259) + node-record + superblock through `BlockDevice.writeBlocks @1`.
  Because `File.read` serves the RAM content cache the mount loaded (not a fresh disk
  read), a same-extent disk re-read of a just-written file -- which needs a remount/
  reboot -- is out of scope; the matching block-0 digests prove the same payload
  traversed both device arms, each device-acked. The single-writer policy is proven
  intact: a `File.write` through a second granted `writable_fs_root` cap fails closed.
  It emits one `cloudboot-evidence: provider-writable-fs-over-nvme <token>` marker
  labeled `write_path=file-write-over-blockdevice-writeblocks
  read_path=file-read-over-blockdevice-readblocks consumer=writable-fs
  fs_iface=Directory file_iface=File block_iface=BlockDevice fs_format=CAPOSWF1
  write_method=1 read_method=0 write_read_roundtrip_match=true durability_attested=true
  single_writer_policy=enforced second_writer_denied=true recovery_over_nvme=deferred
  virtio_regression=green`. The round-trip issues 11 single I/O commands (3 seed
  WRITEs + 3 deferred-mount READs + 5 `File.write` WRITEs); `NVME_IO_QUEUE_DEPTH`
  stays 16. Proof: `make run-cloud-provider-writable-fs-over-nvme`. No schema/binding
  change. Future work (not yet implemented): the unclean-shutdown / forced-
  poweroff recovery window (`recovery_crash_after_record`) over the NVMe arm (the
  analogue of `run-storage-writable-recovery`, proved on virtio here), a real NVMe
  FLUSH on `flush @3`, an NVMe reboot-persistence pass, a dedicated I/O-completion
  `Interrupt` route on the data path, NLB > 1 spanning multiple PRP pages, and
  graduating the NVMe data plane out of the per-proof feature into production.

- `cloud-prod-writable-fs-over-nvme-recovery-local-proof` proves the
  unclean-shutdown / forced-
  poweroff RECOVERY window (`recovery_crash_after_record`, the record-sector-written-
  but-superblock-not-yet-committed window in `kernel/src/cap/writable_fs.rs`)
  over the NVMe `BlockDevice` arm. A new `cloud_writable_fs_over_nvme_recovery_proof`
  feature implies (and supersedes the happy-path proof module/route/init of)
  `cloud_writable_fs_over_nvme_proof` and widens the `storage_writable_recovery`
  crash-window cfg gate so the same `recovery-orphan.txt` sentinel arms an induced
  forced poweroff when the writable filesystem is NVMe-backed. The recovery cap-waiter
  module (`kernel/src/cap/writable_fs_over_nvme_recovery_proof.rs`) reuses the NVMe
  `BlockSource` arm, deferred mount, third `WritableFsRoot` grant arm, window ops, and
  I/O queue create unchanged. Unlike the happy-path proof, the `CAPOSWF1` image is
  HOST-BUILT (`tools/mkstore-image --writable-nvme` lays an empty superblock +
  root-only node table) rather than seeded through `@20`: a two-boot SAME-image
  recovery flow cannot re-seed on pass 2 without clobbering the pass-1 committed state,
  and the read ordering does not depend on a per-boot seed here (mirroring the virtio
  `run-storage-writable-recovery` proof, which also boots a host-built image twice).
  `make run-cloud-provider-writable-fs-over-nvme-recovery` boots QEMU twice with
  `-device nvme` against one shared raw drive file: pass 1 commits a `File.write` +
  sub-directory through `writeBlocks @1`, allocates the sentinel (its record sector
  lands on the namespace), and spins; the harness `kill -9`s QEMU before the superblock
  commit. Pass 2 boots the SAME file, mounts by reading the old superblock + node table
  back through `readBlocks @0`, and asserts the recovered tree omits the orphan slot
  (exactly the committed entries remain), preserves the committed mutation (file size +
  content), accepts a usable post-recovery write, and denies a second-grant write
  (single-writer policy). The userspace smoke emits one
  `cloudboot-evidence: provider-writable-fs-over-nvme-recovery <token>` marker labeled
  `crash_window=record-written-superblock-uncommitted orphan_slot_ignored=true
  committed_mutation_survived=true post_recovery_write_ok=true recovery_over_nvme=true
  single_writer_policy=enforced durability_basis=host-page-cache real_flush=deferred
  reboot_persistence=deferred io_completion=polled interrupt_wait=not-used-on-data-path
  virtio_recovery_regression=green live_cloud=not-attempted`; the kernel proof module's
  `on_release` independently attests the cap-waiter route lifecycle + the two CREATE I/O
  queue dispatch/ack cycles. The two CREATE I/O queue commands keep their production
  `Interrupt.wait`/`acknowledge` cap-waiter cycles; the data path stays polled.
  Bounded-proof caveat: one record-vs-commit window, host-page-cache durability (the two
  passes share one backing file; a `kill -9` preserves the host page cache), NOT media
  crash-consistency and NOT a real NVMe FLUSH barrier. No schema/binding change. Future work (not yet implemented): a real NVMe FLUSH on `flush @3`, an NVMe clean-
  reboot-persistence pass, NLB > 1 spanning multiple PRP pages, a dedicated I/O-completion
  `Interrupt` route on the data path, and graduating the NVMe data plane into production.
  Proof: `make run-cloud-provider-writable-fs-over-nvme-recovery`.

- `cloud-prod-nvme-consumer-sync-to-flush-local-proof` routes a consumer-level
  `File.sync @4` to a real `BlockDevice.flush @3` NVMe FLUSH media barrier
  instead of a write-side no-op. `writable_fs::BlockSource` carries a `flush()`
  arm (the `Virtio` variant returns `Ok(())` -- the driver negotiates no
  `VIRTIO_BLK_F_FLUSH`, so virtio `File.sync` stays a byte-identical no-op and
  `make run-storage-writable` stays green; the `Nvme` variant drives
  `nvme_brokered_io_sync_flush_op_for_cap` with the same success predicate the
  read/write arms apply), and `File.sync @4` (`writable_fs.rs`) routes through it
  AFTER the `claim_writer` gate. The feature `cloud_nvme_consumer_sync_to_flush_proof`
  composes `cloud_nvme_blockdevice_flush_crash_consistency_proof` (arming the real
  `flush @3` op-for-cap) and `cloud_writable_fs_over_nvme_proof` (the consumer arm),
  dropping both predecessors' proof modules. The proof
  (`kernel/src/cap/nvme_consumer_sync_to_flush_proof.rs`) seeds the `CAPOSWF1` image,
  then `Directory.open(CREATE)` + `File.write` (through `writeBlocks @1`), then
  `File.sync()` -- which issues the real NVMe FLUSH (opcode `0x00`, CQE status `0`, no
  data transfer: PRP1 = 0, PRP2 = 0) -- then `File.read` confirming the bytes survive.
  The single-writer policy is shown intact: a `File.sync` through a second granted cap
  fails closed BEFORE any FLUSH is issued (`denied_sync_issues_no_flush=true`), and the
  kernel asserts exactly one consumer-sync FLUSH (status `0`) was recorded. It emits
  one `cloudboot-evidence: provider-nvme-consumer-sync-to-flush <token>` marker labeled
  `consumer_sync_path=File.sync-to-nvme-flush sync_method=4 flush_method=3
  nvme_flush_issued_by_consumer_sync=true nvme_flush_opcode=0x00 flush_cqe_status=0
  write_sync_read_roundtrip_match=true single_writer_policy=enforced
  durability_proof=consumer-sync-issues-real-flush virtio_sync_noop=byte-identical`. The
  round-trip issues 12 single I/O commands (3 seed WRITEs + 3 mount READs + 5
  `File.write` WRITEs + 1 `File.sync` FLUSH); `NVME_IO_QUEUE_DEPTH` stays 16. Proof:
  `make run-cloud-provider-nvme-consumer-sync-to-flush`. No schema/binding change. The
  bounded claim is consumer-sync-issues-real-flush, NOT a power-loss survival
  differential (`unflushed_rollback=not-provable-under-qemu-nvme-model`; the cross-boot
  forced-poweroff differential stays as the crash-consistency proof established it).
  Future work (not yet implemented): routing the persistent `Store`'s commit path
  through the FLUSH, graduating the NVMe data plane into production, NLB > 1 spanning
  multiple PRP pages, and a dedicated I/O-completion `Interrupt` route on the data path.

- `cloud-prod-nvme-persistent-store-sync-to-flush-local-proof` routes the
  persistent `Store`'s put-commit path
  to a real `BlockDevice.flush @3` NVMe FLUSH media barrier. The `Store` has NO
  `sync` schema method (`put @0` / `get @1` / `has @2` / `delete @3` only), so the
  routing point is the existing put-commit path, not a new method:
  `persistent_store::BlockSource` gains a `flush()` arm (the `Virtio` variant returns
  `Ok(())` -- no `VIRTIO_BLK_F_FLUSH`, so the virtio `Store.put` commit stays a
  byte-identical no-op and `make run-storage-persist` stays green; the `Nvme` variant
  drives `nvme_brokered_io_sync_flush_op_for_cap` with the same success predicate the
  read/write arms apply, but only when the flush lineage is composed so the plain
  `make run-cloud-provider-persistent-store-over-nvme` commit stays a no-op), and
  `put_blob` (`persistent_store.rs`) issues it AFTER the `flush_superblock` write (the
  ordering commit point) succeeds. A FLUSH that fails closed rolls back the in-RAM
  `entry_count`/`next_free_sector` so no live index insert occurs. The feature
  `cloud_nvme_persistent_store_sync_to_flush_proof` composes (and drops/supersedes) the
  `cloud_nvme_consumer_sync_to_flush_proof` lineage (transitively the real `flush @3`
  op and the persistent-store-over-NVMe read+write seam). The proof
  (`kernel/src/cap/nvme_persistent_store_sync_to_flush_proof.rs`) seeds the `CAPOSST1`
  image, then `Store.put(data)` -- which writes the data extent / entry sector /
  superblock through `writeBlocks @1`, then issues the real NVMe FLUSH (opcode `0x00`,
  CQE status `0`, no data transfer: PRP1 = 0, PRP2 = 0) after the superblock commit --
  then `Store.get(hash)` confirming the bytes survive. The kernel asserts exactly one
  `Store`-commit FLUSH (status `0`) recorded AFTER the superblock write
  (`superblock_commit_before_flush=true`) and emits one
  `cloudboot-evidence: provider-nvme-persistent-store-sync-to-flush <token>` marker
  labeled `consumer_commit_path=store-put-to-nvme-flush put_method=0 flush_method=3
  nvme_flush_issued_by_store_commit=true nvme_flush_opcode=0x00 flush_cqe_status=0
  superblock_commit_before_flush=true put_get_roundtrip_match=true
  failed_flush_issues_no_live_entry=true virtio_commit_noop=byte-identical
  durability_proof=store-commit-issues-real-flush`. The round-trip issues 9 single I/O
  commands (2 seed WRITEs + 2 mount READs + 3 `Store.put` WRITEs + 1 put-commit FLUSH +
  1 `Store.get` READ). Proof:
  `make run-cloud-provider-nvme-persistent-store-sync-to-flush`. No schema/binding
  change. The bounded claim is store-commit-issues-real-flush, NOT a power-loss
  survival differential (`unflushed_rollback=not-provable-under-qemu-nvme-model`).
  Future work (not yet implemented): graduating the NVMe data plane out of the
  per-proof features into always-built production, with the dedicated
  I/O-completion `Interrupt` route on the data path and NLB > 1 spanning multiple
  PRP pages.
- `cloud-prod-nvme-sync-io-state-seam-always-built-local-proof` extracts the
  brokered-NVMe synchronous-I/O state the shared op body depends on into ONE
  always-built module `device_manager::nvme_sync_io_state`
  (`kernel/src/device_manager/nvme_sync_io_state.rs`), compiled in the default
  no-proof `cargo build` (not behind any `cloud_nvme_*_proof` feature). The seam owns:
  the functional I/O SQ/CQ reservation cursor (`reserve_io_slot()` = a
  queue-depth-bounded first-pass slot reservation, with one live in-flight
  reservation, so a single-call command cannot reuse a stale CQE before the
  created queue wraps); the admissions predicate
  (`sync_{read,write,flush}_admitted` = bounded-ledger-not-full plus no active
  reservation); and the ordered `SyncIoRecord` op-log ledger
  (`record_sync_{read,write,flush}`, `DIGEST_BYTES`, op kind implied by
  `record.opcode` = `0x00` FLUSH / `0x01` WRITE / `0x02` READ). The shared body
  `nvme_brokered_io_sync_command` and the
  `nvme_brokered_io_sync_{read_window,write_window,flush}_op_for_cap` /
  `nvme_brokered_io_sync_read_bytes_op_for_cap` entries
  (`kernel/src/device_manager/stub.rs`) now record/admit/index through this seam
  instead of the per-proof `nvme_io_proof` alias; the alias stays only for the
  genuinely per-proof create/io-phase/seed ledgers (`record_create_*`,
  `record_io_*`, `next_io_phase`, `next_seed_slba`, `seed_image_sector`). All 15
  NVMe `BlockDevice` proof modules (`kernel/src/cap/*_proof.rs`) delegate to the
  seam: each deletes its private `SyncIoRecord`/`SyncHandoff`/record/admit/index
  copy and reconstructs its release-marker view from the seam's ordered op-log
  snapshot, byte-identical. The create-ordering / write-before-read orderings the
  per-proof harnesses formerly folded into their admit predicates are proof-harness
  assertions each proof still re-derives in its release marker; the always-built
  admit keeps only the production-honest bounded-ledger invariant (a real read
  needs no prior write). Code-location refactor only -- no schema/binding change,
  no new device behavior; default `cargo build` and `cargo build --features qemu`
  stay warning-free (the seam carries a module-level `#[allow(dead_code)]` for its
  dormant not-yet-activated entry points). This UNBLOCKS the read-arm graduation:
  the read body's sync-I/O symbols now resolve in the default build, so the
  graduation can make the read body always-built and gate activation behind a
  fail-closed runtime probe while the proof exercises the same always-built seam.
  Proof: every existing NVMe `BlockDevice` proof stays green (`make
  run-cloud-provider-nvme-blockdevice-arbitrary-lba-read` and the rest of the
  chain) and `make run-net` is byte-identical.

These brokered capabilities target the no-IOMMU QEMU/GCP lane, where queue-base
and PRP addresses are materialized by the kernel/device manager from the live
ledger. On a direct-remapping/vIOMMU gate the provider-written validator model
(`nvme-userspace-bind-and-controller-bringup`) applies instead. The PCI
metadata-only discovery summary (`pci: nvme metadata ...`) that also runs on
`make run-pci-nvme` is the separate enumeration-evidence surface in
`kernel/src/pci.rs`.

## 1. Spec basis

- **Device**: NVM Express PCI controller. PCI class `0x01` (mass storage),
  subclass `0x08` (NVM), programming interface `0x02` (NVM Express). Detected by
  `PciDevice::is_nvme_controller` (`kernel/src/pci.rs`,
  `NVME_CLASS_MASS_STORAGE` / `NVME_SUBCLASS_NVM` / `NVME_PROG_IF_NVM_EXPRESS`).
  QEMU instance: `-device nvme,drive=...,serial=...` on the `q35` machine.
- **Authoritative spec**: NVM Express Base Specification (NVMe 1.4 / 2.0). The
  fields the validator relies on:
  - Controller registers `CAP`, `CC` (with `CC.EN` controller enable), `AQA`,
    `ASQ`, `ACQ` (NVMe Base Spec §3.1 controller register map). `ASQ`/`ACQ`
    base addresses have bits 11:0 reserved → 4 KiB page-aligned.
  - Submission/completion queue base addresses and the per-queue doorbell
    registers in the doorbell stride region (§3.1.x, §7.6 queue setup).
  - Physical Region Page (PRP) entries PRP1/PRP2 and the PRP List (§4.3): PRP
    list pages and list-pointer PRP2 entries are page-aligned; a transfer that
    needs more than one PRP list page chains a further list, which this bounded
    subset does not follow.
- **Reference driver** (optional cross-check): the Linux `drivers/nvme/host/`
  queue-setup and PRP-build paths (`nvme_setup_prps`, `nvme_pci_configure_admin_queue`).

## 2. Wire format (validator-relevant subset)

The validator reads only the device-visible addresses a single doorbell newly
publishes, plus the byte extent of the region each names. It does not decode
command opcodes, data payloads, or completion entries. Scanned items are modeled
by `ScanItem` (`kernel/src/cap/nvme_doorbell_validator.rs`).

- **Queue-base registers** (`ScanItem::QueueBase`): the `ASQ`/`ACQ` admin queue
  bases (scanned on the `CC.EN` / queue-arm write, `ScanKind::QueueArm`) and the
  I/O `SQ`/`CQ` bases. The named region is `entries × entry_size` bytes (e.g. an
  admin SQ is `depth × 64`, an admin CQ is `depth × 16`). Required alignment:
  page (4 KiB).
- **SQ entry PRP pointers** (`ScanItem::Prp`): for each NVMe command newly made
  visible by an SQ tail doorbell (`ScanKind::SqTailDoorbell`), the PRP1 data
  pointer, the PRP2 data-or-list pointer, and one level of PRP-list indirection.
  `list_depth` counts indirection already followed (`0` = a PRP carried in the
  SQE, `1` = a pointer inside the single PRP list page); `list_depth > 1` is the
  out-of-subset deeper-chain case and fails closed (`MAX_PRP_LIST_DEPTH`).
  The named region is the transfer length (PRP data) or one page (a PRP list).
  Required alignment: page (4 KiB).

The scan is **on-notify only**: the provider may freely write its own mapped DMA
pages between doorbells; nothing device-reachable happens until a doorbell rings,
which is the single choke point the validator guards. Cost is
O(descriptors published by this doorbell).

## 3. capOS mapping

The validator is the kernel half of the Model B genuine-userspace-driver model
(`docs/proposals/nvme-model-b-doorbell-dma-validator.md`): the provider writes
the device-visible queue-base and PRP addresses itself, and the kernel validates
them on the doorbell path rather than minting them (Model A, the unchanged
virtio-net TX path).

- **Authority gate**: the live doorbell-path hook derives the owning provider's
  identity (`OwnerToken`) and live grant
  generation from the `DeviceMmio` grant record in the device-manager ownership
  ledger, never from provider-supplied bytes. The `cfg(qemu)` self-test resolves
  owner/generation from synthetic windows only.
- **`DeviceMmio`**: the validator is invoked from the pre-write step of the NVMe
  doorbell/queue-arm selected-write `DeviceMmio` claim
  (`kernel/src/cap/device_mmio.rs`; the existing `notifyDoorbell` path is the
  Model A virtio-net claim and does **not** trigger the validator). The scan
  completes — accept or reject — before the doorbell write is allowed to take
  effect, so the device never sees an unvalidated descriptor batch. BAR0 /
  doorbell pages stay device-uncacheable, NX, capability-scoped.
- **`DMAPool` / window descriptor**: for a direct-remapping/vIOMMU lane, a
  `DmaWindow` can name the owner's domain-scoped IOVA range with a live
  generation, and provider-written values can be checked against that range.
  On the current no-IOMMU lane, there is no provider-visible non-host-physical
  device-address namespace; the manager owns the physical bounce pages and must
  materialize queue-base and PRP/SGL fields itself. See
  `docs/dma-isolation-design.md` (Provider-Written Addresses And No-IOMMU
  Brokered Bounce).
- **`Interrupt`**: `completion_wakes_waiter` enforces the stale-completion gate —
  a completion wakes a waiter only if its submission scan was accepted *and* the
  generation it was validated under is still live; an unvalidated or
  retired-generation completion does not wake a waiter.
- **Fail-closed / validation rules** (`ScanReject`, all reject with no doorbell
  write and no waiter wake): `out-of-window`, `host-physical`,
  `cross-owner-alias`, `region-overrun`, `unaligned`, `deep-prp-chain`,
  `stale-generation`, and `invalid-region`. A doorbell rung after
  revoke/reset/regrant against a stale generation fails closed even when the byte
  value would have been in-window for the prior grant.
- **QEMU-emulable vs hardware-only**: the validator mechanism and its hostile-scan
  invariants are end-to-end provable in QEMU (`make run-pci-nvme`, the
  `nvme: validator ...` proof lines). Live controller bring-up over a real NVMe
  controller — admin/I/O queue creation, IDENTIFY, and a bounded read with the
  validator gating the real doorbell — is QEMU-emulable too and is covered by the
  brokered bring-up capabilities (§4-§9), not the validator mechanism itself.

## 4. Userspace bind (read-only controller bring-up)

`nvme-bind-claimed-mmio-read` stands up the userspace storage-provider bind
foundation over the existing DDF driver foundation. It binds the controller
read-only -- no register write, no DMA submission, and no doorbell -- and so
leaves the controller's existing (firmware-initialized) state untouched.

- **Enumerate → claim → BAR0 preseed**: `bind_qemu_nvme_controller`
  (`kernel/src/pci.rs`) runs for the first enumerated NVMe controller after the
  metadata summary. It preseeds the first decoded memory BAR (BAR0 controller
  registers) for brokered reads
  (`devicemmio_grant_source::preseed_read32_for_device`), then claims the
  function and parks it under `DeviceOwner::ManagerGrantSource`. On any staging
  failure it falls back to the `pci: nvme no-authority/no-driver` line
  (fail-closed): a partially-staged authority surface is never advertised.
- **Grant-source staging**: the same device-agnostic
  `{devicemmio,dmapool,interrupt}_grant_source::init_for_device` the virtio path
  uses stage the bootstrap grants against the claimed NVMe handle. The
  virtio-net-specific provider-notify/doorbell selected-write claim is **not**
  staged here — that is the controller-enable path (§6). run-pci-nvme boots with no
  virtio devices, so the singletons are free for NVMe; run-net /
  run-ddf-provider-consumer (no `-device nvme`) keep the virtio bind untouched.
- **Brokered register reads**: the `nvme-bringup-smoke` provider
  (`demos/nvme-bringup-smoke/`) holds the manifest-granted
  `console`/`dmapool`/`device_mmio`/`interrupt` caps and reads `CAP` (`0x0`,
  `0x4`), `VS` (`0x8`), `CC` (`0x14`), and `CSTS` (`0x1c`) through
  `DeviceMmio.read32` (the brokered boot-preseeded mapping in
  `device_manager::read_devicemmio_u32`). It proves the bound claim reaches a
  coherent NVMe BAR0 by requiring a live `CAP` (non-zero, non-floating) and a
  valid `VS` version (NVMe Base Spec §3.1.1/§3.1.2; QEMU reports 1.4.0), and
  reports the observed `CC.EN`/`CSTS.RDY` (§3.1.5/§3.1.6, bit 0 of each).
- **Firmware-initialized controller**: under QEMU's SeaBIOS BIOS boot, the
  NVMe boot-probe enables the controller (`CC.EN=1`, `CSTS.RDY=1`) before init
  runs, so the read-only bind observes a live controller. Bringing it to a
  known reset state (`CC.EN=0`, wait `CSTS.RDY=0`) before re-enabling with
  provider-owned admin queues is the controller-enable path's responsibility
  (§6), not this read-only bind.
- **Proof line**: the userspace `[nvme-bringup-smoke] controller-bind ok ...
  mmio_read=brokered controller_state=firmware-enabled` read proof, asserted by
  `tools/qemu-pci-nvme-smoke.sh`. The kernel bind line advanced from
  `controller_init=read-only-bind` to `controller_init=reset-capable-bind`
  when §5 added the `CC` selected-write claim to the same grant staging; the
  read proof itself is unchanged.

## 5. Userspace controller reset (selected-write `CC` claim)

`nvme-controller-reset-selected-write` is the first genuine userspace NVMe
controller-register *write*: it brings the firmware-enabled controller to a
known reset state. It does **not** enable the controller, program admin/IO
queue bases, submit DMA, or ring a doorbell -- controller enable publishes
admin queue-base addresses and is the validator-gated path in §6.

- **NVMe `CC` selected-write claim**: the NVMe bind stages the `DeviceMmio`
  grant region with a reset-only selected-write claim
  (`DeviceMmioWrite32ClaimProvider::NvmeControllerRegister`,
  `device_manager::nvme_controller_register_grant_region` →
  `proofs::nvme_controller_register_region`), scoped to the `CC` register
  (`0x14`, NVMe Base Spec §3.1.5). `bind_qemu_nvme_controller`
  (`kernel/src/pci.rs`) now calls
  `devicemmio_grant_source::init_nvme_controller_for_device` instead of the
  plain `init_for_device`, so the single granted cap carries both the brokered
  read surface and the `CC` write claim.
- **Value-flexible, scoped**: unlike the virtio variants (which pin an
  exact `(offset, value)` pair), the NVMe claim is offset-scoped and
  value-flexible only for reset in `validate_devicemmio_write32_claim`
  (`kernel/src/device_manager/qemu_full.rs`): it admits any `CC` write whose
  `CC.EN` (bit 0) is clear -- the read-modify-write reset -- directly, fails
  closed on raw `CC.EN=1` writes with
  `devicemmio-nvme-cc-enable-raw-blocked`, and fails closed on any write to a
  non-`CC` offset (`unclaimed-register-write`). Refused writes perform no
  MMIO.
- **Reset sequence**: the `nvme-bringup-smoke` provider reads `CC`, writes it
  back with `CC.EN` cleared through `DeviceMmio.write32` (the volatile MMIO
  write in `device_manager::write_devicemmio_u32`, resolved through the
  boot-preseeded BAR0 mapping), and polls `CSTS` (§3.1.6) until `CSTS.RDY`
  clears. QEMU clears `CSTS.RDY` synchronously on the `CC.EN=1→0` write
  (`nvme_ctrl_reset`).
- **No DMA validator involvement**: a reset write (`CC.EN` clear) publishes no
  queue-base or PRP addresses, so the Model B on-notify validator (§2/§3) is
  not invoked. The validator is invoked on the explicit brokered
  controller-enable queue-arm path (§6).
- **Proof lines** (asserted by `tools/qemu-pci-nvme-smoke.sh`):
  `pci: nvme userspace-bind ... controller_init=reset-capable-bind ...
  cc_selected_write=staged` (kernel),
  `[nvme-bringup-smoke] cc-raw-enable-refused ...`,
  `[nvme-bringup-smoke] non-cc-write-refused ...`, and
  `[nvme-bringup-smoke] controller-reset ok ... csts_rdy_before=1
  csts_rdy_after=0 cc_en_after=0 reset_write=performed ...` (userspace).

## 6. Brokered controller enable (no-IOMMU, manager-authored admin queues)

`nvme-no-iommu-brokered-controller-enable` enables the controller on the
no-IOMMU `make run-pci-nvme` gate without exporting a host physical (== the
device-visible address on the bounce shape) or raw IOVA to the provider. The
provider invokes the explicit no-parameter
`DeviceMmio.brokeredNvmeControllerEnable` verb (schema `@6`); the manager
authors every address-bearing register and the selected `CC` value from its
live DMA ledger. Raw `DeviceMmio.write32(CC, value with CC.EN=1)` fails closed
before any MMIO side effect. The earlier
provider-written Model B enable (`nvme-userspace-bind-and-controller-bringup`)
stays blocked: it would require the provider to author a device-visible
queue-base, which the reviewed `iova_export=disabled-future-only` discipline
forbids on this gate.

- **Brokered admin queue memory**: the provider allocates the admin submission
  and completion queue pages through the device's `DMAPool` authority
  (`DmaPool.allocateBuffer`), maps each read-write to fill it, and unmaps. By
  convention the manager reads the admin SQ from pool slot 0
  (`NVME_ADMIN_SQ_POOL_SLOT`) and the admin CQ from slot 1
  (`NVME_ADMIN_CQ_POOL_SLOT`). The pages stay live in the manager ledger;
  `DmaBuffer.info` continues to report `device_iova=0`,
  `iova_export=disabled-future-only`, `host_physical_user_visible=false`.
- **Manager-authored queue-base registers**: the @6 method dispatches through
  `nvme_brokered_controller_enable_op_for_cap`
  (`kernel/src/device_manager/qemu_full.rs`), which validates the cap against
  the live NVMe controller-register claim and then calls
  `nvme_brokered_admin_queue_enable`. It resolves the admin SQ/CQ pages
  (`record.attached_dmapools[..].proof_buffers[slot].page`), then authors
  `AQA` (`0x24`, zero-based admin queue sizes), `ASQ` (`0x28`/`0x2c`), and `ACQ`
  (`0x30`/`0x34`) from the ledger page physical addresses via
  `nvme_authored_register_write` (volatile writes resolved only through the
  boot-preseeded BAR0 mapping), and finally performs the manager-selected
  `CC.EN | IOSQES=6 | IOCQES=4` write. No provider-supplied controller bits or
  address-bearing value reaches the controller.
- **Validator on the queue-arm path**: before any register write the authored
  ASQ/ACQ bases are passed through the Model B on-notify DMA validator
  (`crate::cap::nvme_doorbell_validator::validate_doorbell_scan`,
  `ScanKind::QueueArm`). On this path the windows and the scanned items both
  derive from the *same* kernel ledger pages, so it is a self-consistency check
  on the kernel-authored bases: it proves page alignment and in-window
  containment of the named queue region (`entries * entry_size` -- the admin SQ
  128 B / CQ 32 B fit the 4 KiB page). The owner-identity, cross-owner-alias,
  host-physical, and stale-generation rejections are structurally unreachable
  here because both sides of each comparison come from the manager (the real
  authority gate against a stale/foreign page is the live-ledger membership
  check below); those hostile rejections are exercised by the bounded
  `cfg(qemu)` self-test (§3). A reject still fails closed before the `CC.EN`
  write.
- **Fail-closed before enable**: raw `write32(CC, CC.EN=1)` returns
  `devicemmio-nvme-cc-enable-raw-blocked` before any MMIO. The explicit
  manager-op's real authority gate is live-ledger membership -- an enable
  request with the admin queue pages unallocated, freed, or in-flight returns
  `nvme-admin-queues-not-armed`
  (`devicemmio-nvme-cc-enable-not-armed`) with no MMIO side effect, covering the
  out-of-order manager operation and the post-free stale re-enable. A validator reject
  returns `devicemmio-nvme-cc-enable-validator-reject`.
- **Teardown under live admin queues**: reset (`CC.EN=0`) quiesces the
  controller (`CSTS.RDY` clears) before the admin queue pages are reused;
  `DmaBuffer.freeBuffer` then scrubs each page before the frame is freed
  (`page_scrubbed_before_frame_free=true`), and a subsequent enable with the
  queue memory gone fails closed. The enable path submits no admin commands, so
  there are no live completions or waiters during teardown; the
  "stale/unvalidated completion does not wake a waiter" property
  (`completion_wakes_waiter`) is proven by the bounded `cfg(qemu)` self-test
  (§3), not by the live admin-queue teardown.
- **Proof lines** (asserted by `tools/qemu-pci-nvme-smoke.sh`):
  `[nvme-bringup-smoke] admin-queue-allocated ...` (userspace),
  `nvme: brokered-enable owner=nvme-storage trigger=manager-op
  admin_sq_slot=0 admin_cq_slot=1 validator=queue-arm scanned_items=2
  aqa=0x00070007 cc=0x00460001 asq_authored=true acq_authored=true
  cc_en_write=performed cc_bits_selected_by=manager
  queue_base_source=manager-ledger host_physical_user_visible=false ...`
  (kernel),
  `[nvme-bringup-smoke] controller-enable ok ... cc_en_after=1
  csts_rdy_after=1 ... brokered_enable_trigger=manager-op ...`,
  `[nvme-bringup-smoke] teardown-reset ok ...
  quiesced=true`, `[nvme-bringup-smoke] admin-queue-freed ...
  page_scrubbed_before_frame_free=true`, and
  `[nvme-bringup-smoke] stale-enable-refused ...
  brokered_enable_trigger=manager-op reason=nvme-admin-queues-not-armed`
  (userspace).
- **Not in scope (this path)**: I/O queues, read/write commands, cloud
  evidence, and host-physical/IOVA export are out of scope for the enable path.
  `hostile_hardware_isolation=not-claimed`; the brokered no-IOMMU enable is not
  hostile-hardware isolation. One brokered `IDENTIFY` admin command is in §7.

## 7. Brokered admin command + IDENTIFY (no-IOMMU)

`nvme-admin-queue-identify` extends the brokered no-IOMMU lane to one admin
command. After the §6 enable, the provider submits a single `IDENTIFY`
(controller) admin command and consumes its completion from its own mapped admin
CQ. As on the enable path, the manager authors every address-bearing field; the
provider supplies only the non-addressing command dwords and the doorbell index.
`nvme-admin-interrupt-delivery` then makes that completion interrupt-driven: the
provider unmasks the admin completion interrupt route and blocks on
`Interrupt.wait`, the kernel wakes the live waiter through the device-interrupt
dispatch path, and only then is the completion consumed from the mapped CQ.

- **Admin command (wire subset)**: a 64-byte submission queue entry (NVMe Base
  Spec §4.2). The provider writes opcode `0x06` (`IDENTIFY`, §5.17) at byte 0, a
  command id at bytes 2:3, `NSID=0`, and `CNS=0x01` (Identify Controller) in
  CDW10 at bytes 40:43 into the mapped admin SQ page. It leaves the
  address-bearing MPTR (bytes 16:23), PRP1 (bytes 24:31), and PRP2 (bytes 32:39)
  zero; the manager overwrites them. The IDENTIFY data structure is 4096 bytes,
  so a single page-aligned PRP1 covers it and PRP2 stays zero.
- **Doorbells**: the same `NvmeControllerRegister` claim covers the admin SQ
  tail doorbell (`0x1000`) and CQ head doorbell (`0x1004`) -- admin queue 0 with
  doorbell stride `CAP.DSTRD=0` (NVMe Base Spec §3.1.24/§3.1.25;
  `nvme_brokered_admin_sq_doorbell` re-reads `CAP.DSTRD` and fails closed on a
  non-zero stride). The doorbell value (the tail/head index) is not
  address-bearing; the manager bounds it to `<= NVME_ADMIN_QUEUE_DEPTH` and
  performs the write.
- **Manager-authored PRP on the submit path**: a write to the SQ tail doorbell
  is routed by `validate_devicemmio_write32_claim`
  (`NvmeBrokeredWriteOp::AdminSqTailDoorbell`) to
  `nvme_brokered_admin_sq_doorbell` (`kernel/src/device_manager/mod.rs`). It
  resolves the live admin SQ (slot 0), CQ (slot 1), and IDENTIFY data (slot 2,
  `NVME_ADMIN_DATA_POOL_SLOT`) ledger pages, authors the SQE's MPTR/PRP1/PRP2
  from the data page physical address through the SQ page's HHDM mapping
  (`nvme_author_admin_sqe_prp`), fences, then rings the doorbell via
  `nvme_authored_register_write`. No provider-supplied address reaches the
  controller.
- **Validator on the SQ-tail path**: before authoring the SQE the data-buffer
  PRP1 is passed through the Model B on-notify DMA validator
  (`validate_doorbell_scan`, `ScanKind::SqTailDoorbell`,
  `ScanItem::Prp { list_depth: 0 }`): page alignment and in-window containment of
  the 4 KiB data region against the data page's own device-visible window. A
  reject returns `devicemmio-nvme-admin-submit-validator-reject` with no doorbell
  write. As on the queue-arm path the window and scanned item are both
  manager-derived, so the hostile owner/host-physical/stale rejections are
  exercised by the `cfg(qemu)` self-test (§3); the live authority gate is
  live-ledger membership.
- **Interrupt-driven completion wake** (`nvme-admin-interrupt-delivery`): after
  ringing the SQ tail doorbell, the provider unmasks the admin completion
  interrupt route and blocks on `Interrupt.wait`. The route is the NVMe
  controller's bootstrap `Interrupt` grant
  (`DeviceOwner::ManagerGrantSource`, MSI-X table entry 0, role
  `GrantSource`); the kernel wakes the live waiter with a real LAPIC dispatch
  routed through the device-interrupt dispatch slot plus the deferred-EOI and
  waiter path -- the same grant-source delivery model proven by
  `make run-interrupt-grant` and used by the virtio-net provider. The wait
  returns `result=interrupt-delivered real_interrupt_delivery=delivered
  wake_blocked=false` with the route's dispatch `delivery_count` incremented.
  This is a kernel-injected dispatch at the route's programmed LAPIC vector, not
  a device-autonomous MSI-X raise: the NVMe MSI-X table is not yet hardware-
  programmed for an external write (`msix_table_programming=not-written`, as on
  the DDF interrupt-grant path), so device-raised MSI-X delivery and MSI-X table
  programming remain a documented next increment.
- **Completion consume**: only after the interrupt-driven wake does the provider
  read completion queue entry 0 in its mapped admin CQ page (NVMe Base Spec
  §4.6): a 16-byte entry whose DW3 (bytes 12:15) carries the command id (bits
  15:0), the phase tag (bit 16), and the status field (bits 31:17). It checks the
  status field is success and the command id matches, confirms the controller
  DMA-wrote the data structure (non-zero PCI Vendor ID at IDENTIFY byte 0; QEMU's
  `nvme` reports `0x1b36`), then advances the CQ head doorbell. The completion is
  thus consumed after an interrupt-driven wake, with the mapped-CQ read as the
  consume step.
- **Stale/post-reset no-wake**: after the IDENTIFY completes, a second live
  `Interrupt.wait` waiter is installed on the (driver-unmasked) route, observed
  to stay pending, and the route is then masked; the live waiter completes
  `result=interrupt-waiter-cancelled reason=route-masked` rather than woken
  (`masked_live_waiter_woke=false`). At the kernel layer the
  stale/unvalidated/retired-generation completion no-wake invariant
  (`completion_wakes_waiter`) remains proven by the `cfg(qemu)` self-test (§3,
  `waiter_wake=none`).
- **Fail-closed before submit**: a SQ tail doorbell with the admin SQ/CQ or data
  page unallocated, freed, or in-flight returns `nvme-admin-command-not-armed`
  (`devicemmio-nvme-admin-submit-not-armed`) with no MMIO side effect; an
  out-of-range doorbell index returns
  `devicemmio-nvme-admin-doorbell-out-of-range`. Teardown frees the data page
  first and proves the post-free re-submit fails closed.
- **Proof lines** (asserted by `tools/qemu-pci-nvme-smoke.sh`):
  `nvme: admin-submit owner=nvme-storage admin_sq_slot=0 admin_data_slot=2
  validator=sq-tail-doorbell scanned_items=1 command=identify-controller
  prp1_authored=true ... doorbell_written=performed
  host_physical_user_visible=false` (kernel),
  `[nvme-bringup-smoke] admin-interrupt-route-unmasked ...
  route_state_after=driver-unmasked`,
  `[nvme-bringup-smoke] admin-interrupt-wake result=interrupt-delivered
  real_interrupt_delivery=delivered wake_blocked=false ...
  interrupt_driven_wake=delivered` (userspace, the interrupt-driven wake),
  `[nvme-bringup-smoke] identify-complete ok command=identify-controller
  cid=0x0042 status=0x0000 phase=1 ... identify_vid=0x... completion_consumed=
  mapped-admin-cq-after-interrupt-wake ...` (userspace),
  `nvme: admin-complete-ack ... cq_head=1 ... address_bearing=false`,
  `[nvme-bringup-smoke] identify-cq-head-advanced ...`,
  `[nvme-bringup-smoke] admin-interrupt-stale-no-wake ...
  result=interrupt-waiter-cancelled ... masked_live_waiter_woke=false`,
  and `[nvme-bringup-smoke] stale-submit-refused ...
  reason=nvme-admin-command-not-armed` (userspace).
- **Not in scope (this path)**: I/O queue pairs, read/write commands, and the
  remaining out-of-scope items below are covered in §8.

## 8. Brokered I/O queue pair + bounded READ (no-IOMMU)

`nvme-io-queue-and-read` extends the brokered no-IOMMU lane to one I/O queue
pair and one bounded read -- the last piece of the userspace NVMe
storage-provider foundation. After the §7 IDENTIFY, the provider creates one
I/O queue pair (queue id 1) through admin commands, then issues one `READ` on
it. As on every brokered path the manager authors each command's
address-bearing PRP1 from a live ledger page; the provider supplies only the
non-addressing dwords and the doorbell index.

- **I/O queue entry sizes**: the re-enable (§6) must program `CC.IOSQES`
  (bits 19:16, log2 of the 64 B SQ entry = 6) and `CC.IOCQES` (bits 23:20,
  log2 of the 16 B CQ entry = 4) before any I/O queue is created (NVMe Base
  Spec §3.1.5); a `CC.EN` 1->0 reset clears all of `CC`, so the provider sets
  them explicitly (resulting `CC = 0x00460001`). Creating an I/O queue with
  `IOCQES`/`IOSQES` unset is refused by the controller (QEMU returns
  command-specific *Invalid Queue Size*).
- **Create I/O queue commands (wire subset)**: `CREATE I/O COMPLETION QUEUE`
  (opcode `0x05`, NVMe Base Spec §5.3) and `CREATE I/O SUBMISSION QUEUE`
  (opcode `0x01`, §5.4) are admin commands submitted on the admin SQ. CDW10
  carries the zero-based queue size (bits 31:16) and queue id (bits 15:0); the
  create-CQ CDW11 sets PC=1 with IEN=0 (no I/O interrupt; the completion is
  polled), the create-SQ CDW11 sets the completion queue id (bits 31:16) and
  PC=1. PRP1 (the queue base) is left zero by the provider and authored by the
  manager from the I/O CQ (slot 3) / I/O SQ (slot 4) ledger page.
- **Opcode-directed PRP authoring**: a write to the admin SQ tail doorbell is
  routed to `nvme_brokered_admin_sq_doorbell`
  (`kernel/src/device_manager/mod.rs`), which reads the opcode the provider
  wrote into the just-published SQE (at SQ index `tail - 1`) and maps it to the
  live ledger page whose device-visible address is that command's PRP1:
  `IDENTIFY` -> IDENTIFY data (slot 2), `CREATE I/O CQ` -> I/O CQ (slot 3),
  `CREATE I/O SQ` -> I/O SQ (slot 4). An unrecognized opcode fails closed
  (`devicemmio-nvme-admin-submit-unknown-opcode`). The opcode is the only
  provider-supplied input consulted and is non-addressing.
- **READ command (wire subset)**: opcode `0x02` (NVM Command Set §3.x), NSID 1,
  starting LBA 0 in CDW10/11, NLB 0 (zero-based, one block) in CDW12. PRP1 (the
  data buffer) is authored by the manager from the read data page (slot 5).
- **I/O doorbells**: the same `NvmeControllerRegister` claim covers the I/O SQ
  tail doorbell (`0x1008`) and I/O CQ head doorbell (`0x100c`) -- queue id 1
  with doorbell stride `CAP.DSTRD=0` (SQ `y` tail at `0x1000 + (2y)*4`, CQ `y`
  head at `0x1000 + (2y+1)*4`). The I/O SQ tail doorbell is routed to
  `nvme_brokered_io_sq_doorbell`, which requires the opcode to be `READ`,
  materializes the READ PRP1 from the live read data page, validates it through
  the Model B on-notify validator (`ScanKind::SqTailDoorbell`), authors the SQE
  PRP, and rings the doorbell. The I/O CQ head doorbell
  (`nvme_brokered_io_cq_head_doorbell`) carries only the consumed-entry head
  index (no address-bearing field).
- **Completion consume**: the create-queue and READ completions are consumed by
  polling the mapped CQ phase tags. The single kernel grant-source injected
  interrupt delivery is spent on the §7 admin IDENTIFY wake
  (`delivery_count_before == 0` gates injection to one delivery per route), so
  an interrupt-driven I/O completion wake awaits the device-autonomous MSI-X
  table-programming increment that §7 already defers. The provider confirms the
  controller DMA-transferred real data by checking the harness-seeded LBA 0
  signature (`0x4f504143` = "CAPO") in its mapped read data page -- proving the
  read moved bytes through the brokered PRP, not a zero page.
- **Fail-closed before submit**: an I/O SQ tail doorbell with the I/O CQ/SQ or
  read data page unallocated, freed, or in-flight returns
  `nvme-io-command-not-armed` (`devicemmio-nvme-io-submit-not-armed`) with no
  MMIO side effect; an out-of-range index returns
  `devicemmio-nvme-io-doorbell-out-of-range`; an opcode other than `READ`
  returns `devicemmio-nvme-io-submit-unknown-opcode`. Teardown frees the read
  data page first and proves the post-free I/O re-submit fails closed.
- **Proof lines** (asserted by `tools/qemu-pci-nvme-smoke.sh`):
  `nvme: admin-submit ... command=create-io-cq ... admin_data_slot=3 sq_tail=2`,
  `[nvme-bringup-smoke] create-io-cq-complete ok ...`,
  `nvme: admin-submit ... command=create-io-sq ... admin_data_slot=4 sq_tail=3`,
  `[nvme-bringup-smoke] create-io-sq-complete ok ...`,
  `nvme: io-submit owner=nvme-storage io_queue_id=1 io_sq_slot=4
  io_read_data_slot=5 ... command=read ... io_sq_tail=1 doorbell_offset=0x1008
  doorbell_written=performed host_physical_user_visible=false` (kernel),
  `[nvme-bringup-smoke] io-read-complete ok command=read cid=0x0053
  status=0x0000 ... io_read_dword0=0x4f504143 ...
  completion_consumed=mapped-io-cq-polled` (userspace, the read data proof),
  `nvme: io-complete-ack ... io_cq_slot=3 cq_head=1 ... address_bearing=false`,
  and `[nvme-bringup-smoke] io-stale-submit-refused ...
  reason=nvme-io-command-not-armed` (teardown).
- **Not in scope**: device-autonomous MSI-X delivery (hardware MSI-X table
  programming, a device-raised I/O completion interrupt, and an interrupt-driven
  I/O completion wake), multi-block / write / scatter-gather (PRP-list) I/O,
  cloud (GCP/AWS/Azure) enumeration or evidence, and host-physical/IOVA export
  remain out of scope. `hostile_hardware_isolation=not-claimed`.

## 9. Production-path cloudboot proofs (non-`qemu` cloud kernel)

This section covers the non-`qemu` cloudboot kernel proofs. The older
`cloud-prod-storage-bound-local-proof` storage-bind path predates the later
production-stub NVMe manager operations: it binds
`DeviceMmio`/`DMAPool`/`Interrupt` surfaces to one NVMe function and
exercises an interrupt-dispatch proxy, but does not attempt controller enable,
admin commands, I/O queues, `IDENTIFY`, `READ`, or a userspace storage provider.

- **Older storage-bind proxy**: `cap::storage_bind_proof::report`
  (`kernel/src/cap/storage_bind_proof.rs`) runs under
  `#[cfg(not(feature = "qemu"))]` during `kernel::run_init`
  (`kernel/src/main.rs`). It selects an NVMe function with
  `PciDevice::is_nvme_controller` (`kernel/src/pci.rs`), stages a readback
  `DeviceMmio` record through `device_manager::stage_bar_readback_region`, a
  parked bounce `DMAPool`/`DMABuffer` through
  `device_manager::stage_bounce_buffer_dmapool_record` and
  `device_manager::issue_manager_attached_dmabuffer_handle_with_request`, and
  one MSI-X interrupt route through `device_interrupt` plus mask-first PCI MSI-X
  table programming. The I/O-completion evidence is a kernel-side proxy:
  `device_interrupt::handle_lapic_delivery` advances the live dispatch slot,
  deferred EOI is acknowledged, masked no-wake is checked, and teardown proves
  stale route/pool/buffer/MMIO handles fail closed. Its marker is
  `cloudboot-evidence: storage-bound <token>`, with summary fields such as
  `nvme_admin_identify=not-attempted`, `nvme_read_command=not-attempted`, and
  `waiter_wake=kernel-side-proxy`.
- **Later production-stub manager ops**: the non-`qemu` cloudboot kernel also
  implements real production-stub NVMe operations for the same local
  QEMU/cloudboot lane. The read-only bind, reset-only `CC.EN=0` selected-write
  claim, parked
  admin SQ/CQ/data `DMABuffer` materialization, brokered controller-enable
  manager operation (`DeviceMmio.brokeredNvmeControllerEnable @6`), and brokered
  admin `IDENTIFY Controller` manager operation
  (`DeviceMmio.brokeredNvmeAdminIdentify @7`) live in
  `kernel/src/device_manager/stub.rs` and the production grant-source modules.
  These operations are not the storage-bind proxy: the manager authors
  `AQA`/`ASQ`/`ACQ`, `CC.EN=1`, the fixed IDENTIFY SQE, PRP1, SQ tail doorbell,
  CQ polling, and CQ head doorbell from its parked ledger. The provider still
  supplies no host-physical address, IOVA, queue base, PRP/SGL, opcode,
  command id, doorbell offset, or doorbell value.
- **Local production provider chain**: the moved
  `cloud-prod-nvme-brokered-userspace-provider-local-proof` parent is closed by
  production-stub child records over the non-`qemu` cloudboot kernel. The local
  QEMU/cloudboot chain reaches split admin completion (`@8` / `@9` plus
  `Interrupt.wait` / `Interrupt.acknowledge`), I/O queue creation (`@10` /
  `@11`), bounded READ/WRITE (`@12`-`@15`), second-LBA/multiblock I/O
  (`@16`-`@19`), synchronous read/write and read-bytes (`@20`-`@22`),
  `BlockDevice.readBlocks` / `writeBlocks` / FLUSH, higher-level filesystem and
  Store consumers, a dedicated data-path completion `Interrupt` route, and
  multi-PRP `BlockDevice` windows. These are manager-authored brokered
  operations in `kernel/src/device_manager/stub.rs` and the production proof
  modules, not provider-authored Model B doorbell writes.
- **READ-arm graduation to always-built production** (`cloud-prod-nvme-storage-graduate-readarm-local-proof`):
  the NVMe `BlockDevice` READ arm is the first capstone piece graduated OUT of the
  per-proof `cloud_nvme_*_proof` features into always-built production code. The
  `BlockDeviceBackend::NvmeBrokered` arm and its arbitrary-window `readBlocks @0`
  body (`kernel/src/cap/block_device.rs`, `cfg(not(qemu))`), the shared read body
  `nvme_brokered_io_sync_command` / `nvme_brokered_io_sync_read_window_op_for_cap`
  and the brokered controller bring-up registers/helpers it reaches
  (`kernel/src/device_manager/stub.rs`), and `live_handle_for_nvme_blockdevice`
  now compile in the default no-proof `cargo build` / `make capos-cloudboot-image`
  kernel -- the GCE-validated production composition. ACTIVATION is fronted by a
  fail-closed runtime capability probe `kernel/src/nvme_storage_backend.rs`
  (`dma_backend.rs`-style atomic verdict + `select_nvme_blockdevice_handle()`
  resolver): the cap is minted only when a staged `device_mmio` grant resolves a
  live brokered-controller handle (recording the verdict), else a typed error is
  returned -- never a panic. The no-NVMe default boot leaves the probe unverified,
  so the `block_device` grant fails closed. `writeBlocks` / `flush` stay
  fail-closed on the graduated arm (named follow-up graduations). The graduated
  data plane is bounded, not a general-purpose driver: every command runs
  through the synchronous single-call seam
  (`kernel/src/device_manager/nvme_sync_io_state.rs`), which admits at most 64
  single-call I/O commands per boot (`MAX_SYNC_OPS`) and permanently rejects
  further commands at the first I/O CQ wrap (no CQ phase-toggle handling) --
  both limits fail closed. Namespace geometry is IDENTIFY-derived, not assumed:
  after the fixed three-command bring-up sequence completes, the first geometry
  consultation issues one manager-authored IDENTIFY Namespace (CNS `0x00`,
  NSID 1, admin SQ index 3 / tail 4, NVMe Base Spec §5.17) through
  `nvme_namespace_geometry_for_cap` (`kernel/src/device_manager/stub.rs`),
  parses NSZE plus the active LBA format (FLBAS + LBAF `LBADS`/`MS`), caches the
  verdict for the boot, and emits `nvme: brokered-identify-namespace ...
  nsze=... flbas=... lbads=... supported=...`. `BlockDevice.info @2` and the
  `readonly_fs`/`persistent_store`/`writable_fs` NVMe `BlockSource::info`
  report this IDENTIFY-derived geometry, and every read/write window bound is
  enforced against it; while the claim is unavailable (bring-up incomplete, a
  failed claim reset the controller, or an unsupported format -- anything other
  than 512 B data blocks with no interleaved metadata) those paths fail closed
  instead of falling back to a fixture constant. Proof:
  `make run-cloud-provider-nvme-blockdevice-read-graduated` emits
  `cloudboot-evidence: provider-nvme-blockdevice-read-graduated <token>`
  (`read_arm=always-built data_plane_feature_gated=false probe_verdict=verified
  nvme_read_roundtrip_match=true`). This is a local QEMU/cloudboot proof; it does
  NOT claim a live cloud NVMe run, direct DMA, IOVA export, or a write/durability
  graduation.
- **Production boundary**: one production-stub NVMe path now has live GCE
  Persistent Disk evidence: `provider-nvme-io-read` completed one brokered
  512-byte READ on run `1780806087-bf69`. Other production-stub NVMe proofs
  remain local QEMU/cloudboot evidence unless their task record explicitly says
  otherwise. The current evidence still does not claim direct DMA, cloud/guest
  IOMMU support, provider-visible device addresses, device-autonomous MSI-X
  delivery, AWS/Azure storage, a reusable storage provider, or full filesystem
  integration. The NVMe `BlockDevice` READ data plane is graduated to
  always-built production (above); other write/FLUSH/filesystem consumers and
  broader windows have local proof coverage but remain bounded by their recorded
  production proof gates unless their task record explicitly says the surface
  was graduated.