Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVMe (NVM Express controller)

This is a provenance map for the NVMe controller wire subset the kernel Model B on-notify DMA validator scans on the doorbell/queue-arm path. It cites the spec basis, summarizes only the register and descriptor fields the validator actually reads, and points into the implementation by symbol name. It is not a re-spec.

Maturity caveat. This page documents the DMA validator mechanism, a brokered no-IOMMU bring-up through one bounded I/O read on the local QEMU make run-pci-nvme gate, and one bounded live-GCE Persistent Disk proof for the production provider-nvme-io-read path. It is still not a general production NVMe driver, not broad GCP/AWS/Azure storage readiness, and not a provider-visible address or direct-DMA claim. It also records the 2026-05-27 correction: on the current no-IOMMU gate, provider-written queue-base or PRP addresses would be host physical addresses, so the live no-IOMMU path must be brokered by the kernel/device manager unless a verified IOMMU/vIOMMU or synthetic address namespace is added. The capabilities implemented against make run-pci-nvme and the later production cloudboot gates:

  • nvme-doorbell-dma-validator is the kernel on-notify scan (kernel/src/cap/nvme_doorbell_validator.rs); it proves its invariants with a cfg(qemu) self-test (prove_qemu_on_notify_scan_contract) using synthetic owner windows in place of a live grant ledger.

  • nvme-bind-claimed-mmio-read adds the read-only userspace bind (§4): the kernel claims the enumerated controller, preseeds its BAR0 controller-register page, and stages the DMAPool/DeviceMmio/Interrupt bootstrap grant sources against it, and the userspace nvme-bringup-smoke provider reads CAP/VS/CC/CSTS through the brokered claim, proving the claim reaches a coherent NVMe BAR0 (live CAP, valid VS version). The controller is firmware-initialized under SeaBIOS NVMe boot-probe (CC.EN=1, CSTS.RDY=1), so the provider reports the observed enable/ready state rather than asserting reset.

  • nvme-controller-reset-selected-write adds the userspace controller reset (§5): the DeviceMmio grant now carries a reset-only NVMe controller-register selected-write claim scoped to CC, and the provider drives the firmware-enabled controller to a known reset state (CC.EN=0CSTS.RDY=0). This is the first genuine userspace NVMe controller-register write.

  • nvme-no-iommu-brokered-controller-enable adds the brokered no-IOMMU enable (§6): the kernel authors AQA/ASQ/ACQ from the live DMA ledger and performs the CC.EN write, reaching CSTS.RDY=1 without exposing a queue-base address.

  • nvme-admin-queue-identify + nvme-admin-interrupt-delivery add the brokered admin SQ/CQ doorbell and one interrupt-driven IDENTIFY (§7).

  • nvme-io-queue-and-read adds the brokered I/O queue pair and one bounded READ (§8) – the last piece of the userspace storage-provider foundation.

  • cloud-prod-nvme-userspace-provider-readonly-bind-local-proof ports the same-BDF read-only bind shape onto the non-qemu cloudboot kernel under the cloud_nvme_readonly_bind_proof Cargo feature. The feature constrains the three production grant sources (devicemmio_grant_source_prod, dmapool_grant_source_prod, interrupt_grant_source_prod) to the NVMe function (class 0x01 subclass 0x08); the userspace cloud-nvme-readonly-bind-smoke provider receives a same-BDF DeviceMmio/DMAPool/Interrupt bundle, reads CAP_LO/CAP_HI/VS/CC/ CSTS via brokered DeviceMmio.read32, releases the three caps, and asserts stale-handle rejection on each. Proof: make run-cloud-provider-nvme-readonly-bind. No CC.EN write, no admin or I/O queue, no IDENTIFY, no Interrupt.wait, no DMA, no live cloud.

  • cloud-prod-nvme-controller-reset-selected-write-local-proof layers the reset-only CC selected-write authority on the same-BDF bundle under the cloud_nvme_controller_reset_proof Cargo feature (which implies cloud_nvme_readonly_bind_proof, so the picker constraints are inherited). The kernel admits exactly one brokered DeviceMmio.write32 shape through kernel::device_manager::stub::write_devicemmio_u32: a CC write (offset 0x14) whose CC.EN (bit 0) is cleared. A CC write that sets CC.EN fails closed with devicemmio-nvme-cc-enable-deferred, a write to any non-CC offset fails closed with devicemmio-write32-register-unclaimed, and an out-of-range or unaligned offset fails closed at the range validator, all before any volatile MMIO write touches the BAR. The userspace cloud-nvme-controller-reset-smoke provider receives the same-BDF bundle, reads CAP_LO/CAP_HI/VS/CC/CSTS, exercises the two fail-closed write probes (CC.EN=1, non-CC offset 0x18), performs the admitted CC.EN=0 reset write, polls CSTS until CSTS.RDY=0, re-reads CC to assert CC.EN=0, releases the three caps, and confirms stale-handle rejection. Proof: make run-cloud-provider-nvme-controller-reset. No CC.EN=1 write, no admin or I/O queue, no IDENTIFY, no Interrupt.wait, no DMA, no live cloud.

  • cloud-prod-nvme-admin-queue-materialization-local-proof materializes the admin SQ and admin CQ backing buffers on the same-BDF bundle under the cloud_nvme_admin_queue_materialization_proof Cargo feature (which implies cloud_nvme_controller_reset_proof, which implies cloud_nvme_readonly_bind_proof, so the picker constraints and the reset-only CC.EN=0 claim are inherited). No new kernel admission surface is added: the production device_manager::stub already supports manager-owned bounce-buffer allocation through stage_bounce_buffer_dmapool_record + issue_manager_attached_dmabuffer_handle_with_request (a fresh zeroed-on-alloc kernel frame per buffer), scrub-before-frame-free on detach_dmabuffer_record_for_cap_release, and stale-handle rejection on the parked-slot ledger. The userspace cloud-nvme-admin-queue-materialization-smoke provider receives the same-BDF bundle, sequentially materializes the admin SQ backing buffer and the admin CQ backing buffer through the brokered DMAPool.allocateBuffer + DMABuffer.{info,map,unmap,freeBuffer} path (asserting userspace_dma_buffer=manager-issued-bounce-buffer, iova_export=disabled-future-only, host_physical_user_visible=false, and device_iova=0 on each), writes and reads back a deterministic 256-byte template through the userspace VMA, asserts the freshly-allocated admin CQ frame reads back as zero before the write (scrub-before-reuse, paired with the admin SQ’s scrub-before-frame-free), confirms post-free DMABuffer.map fail-closed on the stale handle, emits one cloudboot-evidence: provider-nvme-admin-queue-materialization <token> marker recording both manager-owned pool/buffer slot/generation identities and the discipline labels, releases the three bundle caps, and confirms stale-handle rejection on each. Proof: make run-cloud-provider-nvme-admin-queue-materialization. No NVMe controller register WRITE on this path (the kernel still admits the reset-only CC.EN=0 claim from the controller-reset sibling, but this smoke never calls DeviceMmio.write32), no AQA/ASQ/ACQ publication, no CC.EN=1, no I/O queue allocation, no IDENTIFY, no PRP/SGL publication, no doorbell write, no Interrupt.wait/ Interrupt.acknowledge, no host-physical or IOVA export, no live cloud.

  • cloud-prod-nvme-brokered-controller-enable-local-proof enables the controller through manager-authored AQA/ASQ/ACQ plus a provider-supplied CC.EN=1 write under the cloud_nvme_controller_enable_proof Cargo feature (which implies the three earlier features). The production device_manager::stub parked-pool slot holds two simultaneously-live bounce-buffer DMABuffers (PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 2, PARKED_DMABUFFER_SLOTS = [1, 2]) so the admin SQ and admin CQ can stay parked together; the bounce-buffer grant proof and the virtio-net live-publish proof keep their existing single-buffer behavior (slot 0). The provider-supplied CC.EN=1 write of this path is superseded by cloud-prod-nvme-controller-enable-manager-op-remediation below, which makes raw DeviceMmio.write32(CC, value with CC.EN=1) fail closed before any MMIO side effect and exposes controller enable only through the no-parameter DeviceMmio.brokeredNvmeControllerEnable verb (schema @6). The parked-pool slot capacity, the [1, 2] slot ids, the AQA depth policy, and the four MMIO writes the manager authors all carry over unchanged.

  • cloud-prod-nvme-controller-enable-manager-op-remediation corrects the brokered enable contract. Raw DeviceMmio.write32(CC, value with CC.EN=1) now fails closed with authority_result=devicemmio-nvme-cc-enable-raw-blocked / authority_reason=cc-enable-requires-broker-nvme-controller-enable-op before any volatile MMIO side effect. Controller enable is reachable only through the new no-parameter DeviceMmio.brokeredNvmeControllerEnable verb (schema @6), which carries no offset, value, queue address, queue id, PRP/SGL, or provider-selected controller-bit parameter. The verb routes to the renamed manager-authored nvme_brokered_controller_enable_op_for_cap in kernel/src/device_manager/stub.rs. The manager: (1) validates the cap’s BAR matches the parked region and covers the CC/AQA/ASQ/ACQ register span; (2) resolves the two parked admin queue DMABuffers (slot order: SQ then CQ) and requires both to be live, unmapped, and frame-aligned; (3) selects every controller bit internally – CC.EN | IOSQES=6 | IOCQES=4 (NVMe Base Spec §3.1.5); (4) authors AQA = ((depth-1)<<16) | (depth-1) with depth 8, ASQ low/high from the admin SQ buffer’s phys address, ACQ low/high from the admin CQ buffer’s phys address through the boot-preseeded BAR0 kernel mapping; and (5) performs the manager-selected CC.EN=1 write. The provider supplies no parameters and never observes a host-physical / device-visible queue-base address. The cap dispatch admission carries authority_result=ok, register_write=performed, side_effect=mmio-write-performed, cc_en_write_performed=true, aqa_authored=true, asq_authored=true, acq_authored=true, and queue_base_source=manager-ledger. The kernel diagnostic line is now nvme: brokered-enable owner=cloud-nvme model=cloud-bounce validator=none trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 aqa=0x00070007 cc=0x... asq_authored=true acq_authored=true cc_en_write=performed cc_bits_selected_by=manager queue_base_source=manager-ledger host_physical_user_visible=false proof_result=ok; trigger=manager-op proves the admission entered through @6, not through a raw CC write32. The cloud-nvme-controller-enable-smoke provider proves both the new fail-closed raw CC.EN=1 probe and the manager-op enable, and the headline cloudboot-evidence: provider-nvme-controller-enable <token> marker pins brokered_enable_trigger=manager-op and cc_raw_enable_write=refused. Proof: make run-cloud-provider-nvme-controller-enable. No IDENTIFY, admin or I/O queue command, PRP/SGL publication, doorbell write, Interrupt.wait/Interrupt.acknowledge, host-physical or IOVA export, or live cloud is claimed.

  • cloud-prod-nvme-admin-identify-manager-op-local-proof extends the corrected controller-enable surface with one explicit manager-owned admin-command operation: DeviceMmio.brokeredNvmeAdminIdentify (schema @7). The verb carries no parameters; the cap holder may not supply queue addresses, opcode, command id, NSID, PRP/SGL entries, data-buffer address, doorbell offset, or doorbell value. The production device_manager::stub parked-pool slot capacity was extended from two to three simultaneously-live bounce-buffer DMABuffers (PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 3, PARKED_DMABUFFER_SLOTS = [1, 2, 3]) so the admin SQ (slot 1), admin CQ (slot 2), and IDENTIFY data page (slot 3) can stay parked together; the controller-enable sibling, the bounce-buffer grant proof, and the virtio-net live-publish proof keep their existing single- or dual-buffer behavior unchanged. The production grant source’s kernel-mapped BAR window was correspondingly widened from one to two pages (MAPPED_WINDOW_BYTES = 0x2000 under cloud_nvme_admin_identify_proof) so the admin SQ tail (0x1000) and admin CQ head (0x1004) doorbells fall inside the boot-preseeded mapping the manager already uses for CC/AQA/ASQ/ACQ – raw write32 to either doorbell offset still fails closed at the device-manager boundary as devicemmio-write32-register-unclaimed (the offset is outside the reset-only CC selected-write claim), and the brokered admin IDENTIFY verb is the only path that may ring them. The handler nvme_brokered_admin_identify_op_for_cap in kernel/src/device_manager/stub.rs: (1) validates the cap’s BAR matches the parked region and covers both doorbell offsets and the CSTS register; (2) resolves the three parked admin DMABuffers and requires all three to be live, unmapped, and frame-aligned; (3) re-reads CSTS through the boot-preseeded BAR mapping and refuses if CSTS.RDY=0; (4) authors the full submission queue entry at admin SQ index 0 through the HHDM kernel mapping of the SQ page – opcode IDENTIFY (0x06, NVMe Base Spec §5.17), command id 1, NSID 0, MPTR 0, PRP1 = data-page physical address (sourced from the manager’s parked-pool ledger), PRP2 0, CDW10 CNS 0x01 (Controller); (5) issues a SeqCst fence and rings the admin SQ tail doorbell at BAR0 offset 0x1000; (6) polls the admin CQ entry at index 0 through the HHDM kernel mapping of the CQ page for the phase-bit flip (NVMe Base Spec §4.6 CQE DW3 bit 16); (7) inspects the CQE status field (bits 30:17 of DW3) and command-id echo, refusing on either mismatch; (8) parses IDENTIFY Controller VID (offset 0, 2 bytes) and SSVID (offset 2, 2 bytes) through the HHDM kernel mapping of the data page; (9) advances the admin CQ head doorbell at BAR0 offset 0x1004. The provider sees only bounded-status labels, the manager-selected CNS/opcode/command-id echoes, the three parked-slot identities, the parsed VID/SSVID, and the doorbell side-effect labels. The cap-side dispatch admission carries authority_result=ok, result=ok, register_write=performed, side_effect=mmio-write-performed, sq_doorbell_written=true, cq_doorbell_written=true, completion_consumed=true, cq_status=0x0000, prp_source=manager-ledger, and host_physical_user_visible=false. The kernel diagnostic is nvme: brokered-admin-identify owner=cloud-nvme model=cloud-bounce trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 admin_data_slot=3 cns=0x01 opcode=0x06 command_id=0x0001 ... cqe_status=0x0000 cqe_command_id=0x0001 sq_tail=1 cq_head=1 cq_phase=1 identify_vid=0x1b36 identify_ssvid=0x1af4 sq_doorbell_written=performed cq_doorbell_written=performed completion_consumed=true prp_source=manager-ledger host_physical_user_visible=false proof_result=ok (QEMU’s nvme device reports PCI VID 0x1b36 and SSVID 0x1af4, which the harness pins). The cloud-nvme-admin-identify-smoke provider exercises the inherited fail-closed raw-write claims (six in total: AQA/ASQ/ACQ + raw CC.EN=1 + raw admin SQ tail/CQ head doorbells), invokes the controller-enable verb at @6, invokes the admin IDENTIFY verb at @7, and emits one headline cloudboot-evidence: provider-nvme-admin-identify <token> marker plus three supplementary [cloud-nvme-admin-identify-smoke] discipline-* lines that re-anchor the contract within the per-call Console.writeLine bound. Proof: make run-cloud-provider-nvme-admin-identify. Future work (not yet implemented): I/O queue creation, READ/WRITE, Interrupt.wait/Interrupt.acknowledge admin-completion handoff, device-autonomous MSI-X delivery, host-physical/IOVA export, provider-authored SQE/PRP/SGL bytes, provider-authored doorbell offsets/values, and live cloud traffic.

  • cloud-prod-nvme-admin-completion-wait-ack-local-proof moves the admin IDENTIFY completion handoff off manager-internal CQ polling onto the production Interrupt.wait / Interrupt.acknowledge path. The admission-check-only production Interrupt grant (interrupt_grant_source_prod, wait/acknowledge=admission-check-only) is replaced by a fully-programmed cap-waiter MSI-X route on the same NVMe BDF (kernel/src/cap/nvme_admin_completion_wait_ack_proof.rs, table entry 0): its init registers + claims the route under ManagerGrantSource, programs the MSI-X table entry mask-first with the kernel-authored (message_address, message_data), attaches it to the device manager, arms the deferred-LAPIC-EOI gate, and unmasks. The admin IDENTIFY is split into two manager-owned verbs: DeviceMmio.brokeredNvmeAdminSubmit (schema @8) authors the SQE and rings the admin SQ tail doorbell (no CQ consumed, completion_consumed=false), and DeviceMmio.brokeredNvmeAdminComplete (schema @9) polls/consumes the admin CQ (the manager-owned CQ status/CID check is preserved), parses VID/SSVID, and advances the admin CQ head doorbell. Both reuse the shared NvmeBrokeredAdminOpResult schema struct. The handoff state machine is ordered and one-shot: brokeredNvmeAdminSubmit (@8) records the exact live admin SQ/CQ/data slots and generations; Interrupt.wait is admitted once for that submitted state, revalidates those live DMA records, and consumes the wait phase; brokeredNvmeAdminComplete (@9) is admitted only after the wait phase; and Interrupt.acknowledge is admitted once only after the completion phase has been recorded. Hostile complete-before-wait, ack-before-complete, repeat wait, repeat complete, repeat ack, and submit-then-DMABuffer.freeBuffer attempts fail closed before injecting extra dispatch, retiring an extra EOI, or freeing/scrubbing the manager-owned admin pages. Between the two verbs the provider calls Interrupt.wait – which injects exactly one bounded, non-autonomous device_interrupt::handle_lapic_delivery dispatch on the bound route (result=nvme-admin-completion-wait-ack-dispatch-consumed, real_interrupt_delivery=kernel-injected-dispatch, delivery count +1, one deferred LAPIC EOI armed) – then after the completion verb calls Interrupt.acknowledge to retire exactly that deferred EOI (hardware_dispatch_ack_delta=1). The chain is: read-only bind -> reset-only CC.EN=0 -> manager-owned admin buffer materialization -> brokeredNvmeControllerEnable -> brokeredNvmeAdminSubmit (@8) -> admin SQ tail doorbell -> Interrupt.wait wake -> brokeredNvmeAdminComplete (@9) -> admin CQ completion consumed -> admin CQ head doorbell advanced -> Interrupt.acknowledge deferred EOI retired. On Interrupt cap release the kernel requires exactly one observed dispatch, exactly one observed ack, and the terminal acked handoff state, then runs the masked-no-wake + reassign + stale-handle/stale-token assertion chain and emits exactly one headline cloudboot-evidence: provider-nvme-admin-completion-wait-ack <token> marker labeled admin_completion_wake=provider-cap-side-injected device_autonomous_raise=not-claimed. The wake is the same bounded kernel-injected cap-waiter model as make run-cloud-provider-cap-waiter; this proof does not claim a device-autonomously-raised NVMe MSI-X completion interrupt. Proof: make run-cloud-provider-nvme-admin-completion-wait-ack. Future work (not yet implemented): I/O queue creation, READ/WRITE, BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-queue-create-local-proof adds the single I/O queue pair (queue id 1) on top of the admin chain. After the combined poll-based admin IDENTIFY (DeviceMmio.brokeredNvmeAdminIdentify @7, VID 0x1b36), the manager authors the two queue-establishing admin commands behind parameterless per-command verbs: DeviceMmio.brokeredNvmeCreateIoCqSubmit (schema @10, opcode 0x05, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 PC=1 IEN=0, PRP1 = manager-owned I/O CQ base page) and DeviceMmio.brokeredNvmeCreateIoSqSubmit (schema @11, opcode 0x01, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 = CQ id 1 | PC<<16, PRP1 = manager-owned I/O SQ base page). The opcode/CDWs are manager-selected; the provider supplies nothing (widening @8 with a command-selector parameter was rejected because it would let a provider author arbitrary admin opcodes). Each SUBMIT verb authors the SQE at the next admin SQ index, rings the admin SQ tail doorbell, and records the in-flight create; the completion of each is consumed through the shared DeviceMmio.brokeredNvmeAdminComplete (@9, now command-aware: it reads the admin CQ entry at the recorded index, checks status/CID, and advances the CQ head doorbell) after one provider-cap-side Interrupt.wait, and the deferred LAPIC EOI is retired by one Interrupt.acknowledge. The cap-waiter route (kernel/src/cap/nvme_io_queue_create_proof.rs) drives two bounded kernel-injected dispatch + deferred-EOI cycles – one per create – and its ordered handoff enforces CREATE I/O CQ before CREATE I/O SQ, one create at a time, with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. The I/O CQ/SQ base pages are manager-owned brokered bounce buffers (parked-pool slots 3/4, userspace slots 4/5); their PRP1 is never exported. On Interrupt cap release the kernel requires both creates completed (CQE status 0), exactly two observed dispatches, two observed acks, the idle terminal handoff, and the masked-no-wake + reassign + stale-handle chain, then emits one cloudboot-evidence: provider-nvme-io-queue-create <token> marker labeled io_queue_create_wake=provider-cap-side-injected device_autonomous_raise=not-claimed io_command=create-only io_read=not-attempted io_sq_doorbell=not-attempted. Proof: make run-cloud-provider-nvme-io-queue-create. Future work (not yet implemented): the I/O SQ tail doorbell (0x1008), READ/WRITE, the I/O data page, the I/O-completion route, BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-read-local-proof adds one bounded I/O READ (LBA 0, 1 block, NSID 1) on top of the live I/O queue pair. After the two CREATE I/O queue commands, the manager authors the entire READ SQE behind two parameterless per-command verbs: DeviceMmio.brokeredNvmeIoReadSubmit (schema @12) writes CDW0 (opcode 0x02 | command-id<<16), NSID 1, MPTR 0, PRP1 = manager-owned I/O read-data page (parked-pool slot 5), PRP2 0, SLBA 0 (CDW10/CDW11), NLB 0 = “1 block” (CDW12 bits 15:0) at I/O SQ index 0 and rings the I/O SQ tail doorbell (0x1008); DeviceMmio.brokeredNvmeIoReadComplete (schema @13) polls the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the I/O CQ head doorbell (0x100c), reads the first bytes of the read-data page through the kernel mapping, and surfaces a bounded read-data digest (readDataDigestLo/readDataDigestHi = first 8 bytes, readDataLen = transferred length). The provider supplies no opcode/LBA/PRP/ doorbell (a provider write32(0x1008, ...) path was rejected because it would break the no-provider-authored-command discipline; reusing the create/admin verbs was rejected because they are hardwired to the admin SQ/CQ doorbells and ledger). The cap-waiter route (kernel/src/cap/nvme_io_read_proof.rs) drives three bounded kernel-injected dispatch + deferred-EOI cycles – two creates plus one read – with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed; the read-data page is a manager-owned brokered bounce buffer (parked-pool slot 5, userspace slot 6) whose PRP1 is never exported, and the manager reads the completed block bytes only through the kernel mapping. On Interrupt cap release the kernel requires both creates and the read completed (CQE status 0), the verified block bytes (readDataLen > 0 and a non-zero digest), exactly three observed dispatches, three observed acks, the idle terminal handoffs, and the masked-no-wake + reassign + stale-handle chain, then emits one cloudboot-evidence: provider-nvme-io-read <token> marker labeled io_read_wake=provider-cap-side-injected device_autonomous_raise=not-claimed io_command=read io_read=completed io_sq_doorbell=performed io_cq_completion=polled-io-cq plus io_read_block_bytes=<digest> read_data_len=512. The local QEMU smoke seeds the NVMe backing file’s first sector with a known 16-byte pattern so the digest proves an actual byte transfer, not merely that a CQE arrived. Proof: make run-cloud-provider-nvme-io-read. The same marker shape also passed live on GCE run 1780806087-bf69 (make cloudboot-gcp-storage-nvme-io-read-test) against a Persistent Disk NVMe controller with vendor.1ae0, device.001f, live_cloud=gce-persistent-disk, and a 512-byte READ digest prefix eb3c904c494d494e4520200002000000. Future work (not yet implemented): a dedicated I/O-completion Interrupt route on live cloud, WRITE, multi-block/second-LBA reads, BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud coverage beyond this one GCE PD read.

  • cloud-prod-nvme-io-write-local-proof adds one bounded I/O WRITE (LBA 0, 1 block, NSID 1) of a fixed manager pattern on top of the live I/O queue pair, proven durable by reading it back. After the two CREATE I/O queue commands, the manager authors the entire WRITE SQE behind two parameterless per-command verbs: DeviceMmio.brokeredNvmeIoWriteSubmit (schema @14) pre-fills the manager-owned I/O write-data page (parked-pool slot 6, userspace slot 7) with the fixed 16-byte signature facefeedcafebabe1122334455667788 repeated across the block, then writes CDW0 (opcode 0x01 | command-id<<16), NSID 1, MPTR 0, PRP1 = that page, PRP2 0, SLBA 0, NLB 0 = “1 block” at I/O SQ index 0 and rings the I/O SQ tail doorbell (0x1008); DeviceMmio.brokeredNvmeIoWriteComplete (schema @15) polls the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the I/O CQ head doorbell (0x100c), reads the first bytes of the write-data page through the kernel mapping, and surfaces a bounded written-pattern digest (carried in the shared readDataDigestLo/readDataDigestHi/readDataLen fields). The landed I/O READ (@12/@13) is then reused unchanged to read LBA 0 back into the read-data page (slot 5) at the next I/O SQ/CQ index (1, since the WRITE consumed index 0); the provider supplies no opcode/LBA/PRP/pattern/doorbell, and a new schema field was deliberately avoided – the durability match is computed kernel-side by comparing the written-pattern digest with the read-back digest. The cap-waiter route (kernel/src/cap/nvme_io_write_proof.rs) drives four bounded kernel-injected dispatch + deferred-EOI cycles – two creates, one write, one read-back – with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed; the write-data and read-data pages are manager-owned brokered bounce buffers whose PRP1 is never exported. On Interrupt cap release the kernel requires both creates, the write, and the read-back completed (CQE status 0), non-zero digests, exactly four observed dispatches and acks, the idle terminal handoffs, the masked-no-wake + reassign + stale-handle chain, and the read-back digest matching the written pattern, then emits one cloudboot-evidence: provider-nvme-io-write <token> marker labeled io_command=write io_write=completed io_sq_doorbell=performed io_cq_completion=polled-io-cq write_pattern=<digest> write_readback_match=true. The local QEMU smoke seeds the backing file’s first sector with a distinct sentinel so the read-back of the manager pattern proves the WRITE transferred the bytes. Proof: make run-cloud-provider-nvme-io-write. Future work (not yet implemented): a dedicated I/O-completion Interrupt route distinct from the admin/create/read route, multi-block/second-LBA/second-NSID I/O, flush/FUA/DSM, BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-second-lba-local-proof generalizes the manager-authored data path beyond the hardwired SLBA 0: it proves LBA addressing actually selects the block by driving three sequential I/O commands on the live queue pair through two new parameterless verbs (DeviceMmio.brokeredNvmeIoSecondLbaSubmit @16 / brokeredNvmeIoSecondLbaComplete @17), selected by a kernel-owned phase counter: phase 0 reads LBA 0 (the distinctness baseline, read-data slot 5), phase 1 pre-fills the write-data page (slot 6) with a fixed LBA-1-distinct 16-byte pattern 0123456789abcdeffedcba9876543210 and writes it to LBA 1 (opcode 0x01, CDW10 = SLBA low = 1, CDW11 = SLBA high = 0), and phase 2 reads LBA 1 back. Because the landed @12-@15 verbs hardwire SLBA 0 (and their I/O index depends on the io-write feature), this proof implies only cloud_nvme_io_queue_create_proof and authors its own SQEs at I/O SQ indices 0/1/2; the provider supplies no opcode/LBA/PRP/pattern/doorbell. No schema field was added – the LBA-1 read-back match (LBA-1 read digest == LBA-1 write pattern) and the LBA distinctness (LBA-1 read digest != LBA-0 read digest) are computed kernel-side across the three recorded phase digests. The cap-waiter route (kernel/src/cap/nvme_io_second_lba_proof.rs) drives five bounded kernel-injected dispatch + deferred-EOI cycles (two creates + three I/O phases), with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. On Interrupt cap release the kernel requires both creates and all three phases completed (CQE status 0), non-zero digests, five observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain, second_lba_readback_match=true, and lba_distinct_from_zero=true, then emits one cloudboot-evidence: provider-nvme-io-second-lba <token> marker labeled io_command=second-lba io_second_lba=1 second_lba_readback_match=true lba_distinct_from_zero=true io_sq_doorbell=performed io_cq_completion=polled-io-cq. The local QEMU smoke seeds the backing file’s first sector with the distinct sentinel deadbeefcafebabe0102030405060708 so the LBA-0 read returns content distinct from the LBA-1 pattern. Proof: make run-cloud-provider-nvme-io-second-lba. Future work (not yet implemented): a dedicated I/O-completion Interrupt route, multi-block (NLB > 0) I/O, a third LBA or second namespace, flush/FUA/DSM, BlockDevice/filesystem integration (now unblocked by this LBA parameterization), device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-multiblock-local-proof generalizes the manager-authored data path beyond one logical block: it proves the authored SQE drives a transfer larger than a single block by driving two sequential I/O commands on the live queue pair through two new parameterless verbs (DeviceMmio.brokeredNvmeIoMultiblockSubmit @18 / brokeredNvmeIoMultiblockComplete @19), selected by a kernel-owned phase counter: phase 0 pre-fills the write-data page (slot 6) with two distinct 16-byte patterns – block 0 = 112233445566778899aabbccddeeff00, block 1 = f0e1d2c3b4a5968778695a4b3c2d1e0f – over 1024 B and writes both blocks to LBA 2 (opcode 0x01, NLB = block_count - 1 = 1, CDW10 = 2; PRP1 = slot 6, PRP2 = 0, since the 1024 B transfer fits one 4 KiB page), and phase 1 reads LBA 2 back into the read-data page (slot 5). Because the landed @12-@17 verbs hardwire a single block (block_count = 1), this proof implies only cloud_nvme_io_queue_create_proof and authors its own SQEs at I/O SQ indices 0/1; the provider supplies no opcode/LBA/count/PRP/pattern/doorbell. No schema field was added for the second block’s digest – the existing readDataDigestLo/Hi carry block 0’s first 8 bytes for userspace, while the per-block match (read block 0 == written pattern-0, read block 1 == written pattern-1) and the block-distinctness (block 0 digest != block 1 digest, proving the second 512 B block actually transferred) are computed kernel-side across the two recorded phase digest pairs and attested in the headline marker. The cap-waiter route (kernel/src/cap/nvme_io_multiblock_proof.rs) drives four bounded kernel-injected dispatch + deferred-EOI cycles (two creates + two I/O phases), with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. On Interrupt cap release the kernel requires both creates and both phases completed (CQE status 0), non-zero digests, four observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain, multiblock_block0_match=true, multiblock_block1_match=true, and multiblock_blocks_distinct=true, then emits one cloudboot-evidence: provider-nvme-io-multiblock <token> marker labeled io_command=multiblock io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true multiblock_block0_match=true multiblock_block1_match=true io_sq_doorbell=performed io_cq_completion=polled-io-cq. Proof: make run-cloud-provider-nvme-io-multiblock. Future work (not yet implemented): a dedicated I/O-completion Interrupt route, NLB > 1 requiring a PRP list / second mapped page, a third LBA or second namespace, flush/FUA/DSM, wrapping the brokered READ/WRITE behind a userspace-served BlockDevice cap (now has both LBA selection and >1-block transfer as prerequisites), device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-synchronous-poll-read-local-proof collapses the four-call submit/wait/complete/ack NVMe I/O lifecycle into ONE synchronous CapObject::call – the shape BlockDevice.readBlocks @0 requires. It adds two parameterless single-call verbs (DeviceMmio.brokeredNvmeIoSyncWrite @20 / brokeredNvmeIoSyncRead @21), each mirroring the combined brokeredNvmeAdminIdentify @7: the manager pre-fills the write-data page (slot 6) with 112233445566778899aabbccddeeff00 (block 0) and f0e1d2c3b4a5968778695a4b3c2d1e0f (block 1), authors the SQE (WRITE opcode 0x01 at I/O SQ index 0 / READ opcode 0x02 at index 1; NSID 1, SLBA 2, NLB = 1 / two 512 B blocks, PRP1 = data page, PRP2 = 0), rings the I/O SQ tail doorbell (0x1008), polls the I/O CQ entry phase bit to completion within a bounded budget, advances the I/O CQ head doorbell (0x100c), and reads block 0/block 1 back – all inside one cap call, with no Interrupt.wait on the I/O data path. The two CREATE I/O queue commands still complete through the cap-waiter Interrupt.wait/acknowledge path, so the route (kernel/src/cap/nvme_io_sync_read_proof.rs) drives only two bounded kernel-injected dispatch + deferred-EOI cycles. The single-call verbs report sqDoorbellWritten, cqDoorbellWritten, and completionConsumed all true in one result; the block-1 match and block-distinctness are computed kernel-side across the recorded WRITE/READ digest pairs. On Interrupt cap release the kernel requires both creates and both single-call I/O commands completed (CQE status 0), non-zero digests, two observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain, sync_block0_match=true, sync_block1_match=true, and sync_blocks_distinct=true, then emits one cloudboot-evidence: provider-nvme-io-sync-read <token> marker labeled io_command=sync-read io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true sync_block0_match=true sync_block1_match=true sync_blocks_distinct=true io_sq_doorbell=performed io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used. Proof: make run-cloud-provider-nvme-io-sync-read. This closes concern (c) of the BlockDevice-shaped read gap (lifecycle collapse) without touching BlockDeviceCap/crate::virtio (concern a), the manager-op routing into a generic cap (concern b), a dedicated I/O-completion Interrupt route, or a PRP list (NLB > 1). Future work (not yet implemented): introducing an NVMe-backed BlockDeviceCap whose readBlocks @0 arm calls this single-call op, a readonly_fs-style consumer over it, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-sync-read-block-bytes-local-proof surfaces the full read-back bytes. It adds one verb, DeviceMmio.brokeredNvmeIoSyncReadBytes @22 () -> NvmeBrokeredAdminOpReadBytesResult, that reuses the landed single-call poll-read body (nvme_brokered_io_sync_command) unchanged but returns the entire 1024 B read-back (block 0 ‖ block 1), read through the kernel mapping, as the inline data :Data field of a new narrow result struct – the full-bytes shape BlockDevice.readBlocks @0 -> (data :Data) requires – instead of folding it to an 8-byte digest. The provider issues @20 (WRITE) then @22 (READ-bytes) as two synchronous cap calls and compares the returned data byte-for-byte to the reconstructed manager-authored page, asserting the two 512 B halves differ; the kernel still attests per-block match and distinctness in the release marker. No host-physical/IOVA address crosses the boundary – only the content bytes the caller already authored. The cap-waiter route (kernel/src/cap/nvme_io_sync_read_bytes_proof.rs) is a clone of the sync-read proof that emits one cloudboot-evidence: provider-nvme-io-sync-read-bytes <token> marker labeled io_command=sync-read-bytes io_slba=2 io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes data_block0_match=true data_block1_match=true data_blocks_distinct=true io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used. Proof: make run-cloud-provider-nvme-io-sync-read-bytes. This closes concern (b) of the BlockDevice-shaped read gap (full-bytes return) without touching BlockDeviceCap/crate::virtio (concern a), arbitrary (startLba, count) parameterization, a dedicated I/O-completion Interrupt route, or a PRP list (NLB > 1). Future work (not yet implemented): an NVMe-backed BlockDeviceCap backend enum whose readBlocks @0 arm calls this op, arbitrary-LBA routing, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-blockdevice-fixed-lba-read-arm-local-proof makes the NVMe namespace consumable through the SAME BlockDevice.readBlocks @0 interface a filesystem consumer calls (proof-fixed LBA arm). It replaces BlockDeviceCap’s bare device_index: usize with a BlockDeviceBackend enum (kernel/src/cap/block_device.rs): the always-built Virtio { device_index } arm (behavior-identical to today, verified by make run-storage-fs) and, under cloud_nvme_blockdevice_read_proof, an NvmeBrokered { handle, owner } arm. The NvmeBrokered arm’s readBlocks @0 accepts ONLY startLba == 2 && count == 2 (fails closed with BlockDevice.readBlocks NVMe arm fixed to SLBA 2 NLB 1 on any other window) and drives the landed nvme_brokered_io_sync_read_bytes_op_for_cap (@22) body into a local 1024 B buffer surfaced as inline Data through the same read_blocks_results builder the virtio arm uses; writeBlocks/flush fail closed (read-only namespace) and info returns the fixed geometry. A new bootstrap grant arm mints the NvmeBrokered cap bound to the SAME live device_mmio handle/owner the production grant source staged (devicemmio_grant_source_prod::live_handle_for_nvme_blockdevice), so the device_mmio grant must precede the block_device grant in the manifest cap list. The provider drives the full bring-up (reset → enable → IDENTIFY @7 → CREATE I/O CQ @10 → CREATE I/O SQ @11 → @20 WRITE) and issues the TERMINAL read through BlockDevice.readBlocks(2, 2) instead of a raw DeviceMmio @22. The cap-waiter route (kernel/src/cap/nvme_blockdevice_read_proof.rs, a clone of the sync-read-bytes proof) emits one cloudboot-evidence: provider-nvme-blockdevice-read <token> marker labeled read_path=blockdevice-readblocks read_iface=BlockDevice read_method=0 io_slba=2 io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes nvme_arm_fixed_lba=true arbitrary_lba=not-supported. Proof: make run-cloud-provider-nvme-blockdevice-read. This closes concern (a) of the BlockDevice-shaped read gap (the same schema method, not the bespoke @22 verb), restricted to the proof-fixed window. Future work (not yet implemented): arbitrary (startLba, count) parameterization, NVMe write/flush durability through BlockDevice, a readonly_fs-style filesystem mounted over the NVMe BlockDevice cap, a dedicated I/O-completion Interrupt route, NLB > 1 with a PRP list, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • cloud-prod-nvme-blockdevice-arbitrary-lba-read-local-proof widens the NvmeBrokered arm off the hardwired SLBA 2 / NLB 1: readBlocks @0 now honors an ARBITRARY (startLba, count) window (read-only, bounded to one PRP1 page). The shared single-call body nvme_brokered_io_sync_command (kernel/src/device_manager/stub.rs) gains explicit slba/block_count fields on SyncIoParams and authors CDW10/CDW11 (SLBA) and CDW12 (NLB) from them instead of the module constants; the transfer length is block_count * 512, bounded fail-closed to one 4 KiB PRP1 page (block_count <= 8) so PRP2 stays 0. The existing @20/@21/@22 callers keep passing the proof-fixed SLBA 2 / count 2, so their behavior is byte-identical (regression: make run-cloud-provider-nvme-io-sync-read-bytes). A new parameterized op nvme_brokered_io_sync_read_window_op_for_cap(handle, owner, slba, count, out_data) rotates the I/O SQ/CQ index off the kernel-side read sequence (window 0 at index 1 after the WRITE at index 0, window 1 at index 2) so each completion is polled at the CQ slot the controller actually writes. Under cloud_nvme_blockdevice_arbitrary_lba_proof (implies and supersedes cloud_nvme_blockdevice_read_proof), BlockDeviceCap::nvme_read_blocks admits any 1 <= count <= 8 window with startLba + count <= namespace blocks (the IDENTIFY-derived NSZE reported through info @2 – 16 MiB / 512 = 32768 on the QEMU fixture image; see the READ-arm graduation entry), and fails closed with distinct errors for count == 0 / count > 8 (... count out of range (1..=8)) and a window past the namespace end (... window past namespace end). The proof (kernel/src/cap/nvme_blockdevice_arbitrary_lba_proof.rs, a clone of the fixed-LBA module) drives the full bring-up plus the @20 WRITE (seeding LBA 2 = pattern-0, LBA 3 = pattern-1), then issues TWO distinct readBlocks windows – readBlocks(0, 1) (zero-filled LBA 0) and readBlocks(3, 2) (LBA 3 = pattern-1, LBA 4 = zero-filled) – comparing each returned data byte-for-byte to the manager-authored content and asserting the two windows return distinct content. It emits one cloudboot-evidence: provider-nvme-blockdevice-arbitrary-lba-read <token> marker labeled arbitrary_lba=supported window0_slba=0 window0_count=1 window1_slba=3 window1_count=2 windows_distinct=true prp_pages=single nvme_arm_fixed_lba=false. Proof: make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read. With this, the NVMe namespace is readable through BlockDevice.readBlocks @0 at the LBA the consumer names. Future work (not yet implemented): NLB spanning more than one PRP1 page (count > 8) with a PRP list, NVMe write/flush durability through BlockDevice, a readonly_fs-style filesystem mounted over the NVMe BlockDevice cap, a dedicated I/O-completion Interrupt route, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • cloud-prod-nvme-blockdevice-writeblocks-durability-arm-local-proof arms the NvmeBrokered arm’s writeBlocks @1 (read-only until now): it drives the brokered NVMe sync WRITE with the caller-supplied (startLba, count, data) and proves write-then-read-back durability. A new parameterized op nvme_brokered_io_sync_write_window_op_for_cap(handle, owner, slba, count, in_data) (kernel/src/device_manager/stub.rs) mirrors the arbitrary-window READ entry but rotates the I/O index off a kernel-side write sequence (next_write_io_index() => index 0, before the read-back at index 1). The shared single-call body nvme_brokered_io_sync_command gains a third fill mode – a write_payload: Option<&[u8]> that copies the caller’s count * 512 bytes into block 0..count of the manager-owned write-data page (slot 6) through the HHDM mapping before the WRITE SQE is authored – beside the fixed prefill_pattern and the readonly_fs seed_image modes (both unchanged: regressions make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read and make run-storage-fs stay green). Under cloud_nvme_blockdevice_writeblocks_proof (implies and supersedes cloud_nvme_blockdevice_arbitrary_lba_proof), BlockDeviceCap::nvme_write_blocks admits any 1 <= count <= 8 window with startLba + count <= namespace blocks and data.len() == count * 512, failing closed with distinct errors for zero count, over-capacity, past-namespace-end, and length mismatch; info @2 reports readOnly = false; flush @3 stays fail-closed (a real NVMe FLUSH, opcode 0x00, is a distinct verb – see the flush @3 capability below). The proof (kernel/src/cap/nvme_blockdevice_writeblocks_proof.rs, a clone of the arbitrary-LBA module) drives the full bring-up then writeBlocks(5, 2, data) with a caller-authored, non-zero, two-distinct-block 1024 B payload, followed by readBlocks(5, 2), comparing the read-back byte-for-byte to the bytes written. It emits one cloudboot-evidence: provider-nvme-blockdevice-writeblocks-durability <token> marker labeled write_path=blockdevice-writeblocks write_method=1 write_slba=5 write_count=2 write_data_len=1024 readback_data_len=1024 write_readback_match=true nvme_arm_read_only=false flush=fail-closed prp_pages=single. Proof: make run-cloud-provider-nvme-blockdevice-writeblocks-durability. No schema/binding change (writeBlocks @1 and readBlocks @0 round-trip through existing bindings). Future work (not yet implemented): a writable_fs / persistent_store consumer mounted over the NVMe BlockDevice write arm, a real NVMe FLUSH on flush @3, a dedicated I/O-completion Interrupt route on the data path, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • ddf-nvme-multiprp-blockdevice-window-local-proof extends the same BlockDevice.writeBlocks @1 / readBlocks @0 round-trip to a three-page NVMe PRP window. Under cloud_nvme_blockdevice_multiprp_window_proof, BlockDeviceCap::nvme_write_blocks and BlockDeviceCap::nvme_read_blocks accept count <= 24 for the local proof geometry while the default and older proof builds keep the one-page count <= 8 bound. The shared nvme_brokered_io_sync_command body resolves primary read/write data pages from parked-pool slots 5/6, a manager-owned PRP-list page from slot 7, read extension pages from slots 8/9, and write extension pages from slots 10/11. For the writeBlocks(5, 24, data) and readBlocks(5, 24) proof window it authors PRP1 as the primary data page and PRP2 as a PRP-list page containing two little-endian page pointers, matching the NVMe PRP-list subset in NVMe Base Specification 1.4 §4.3. The provider still supplies only inline Data through the BlockDevice schema and never sees a host physical address, IOVA, PRP1, PRP2, PRP-list page address, SQE byte, doorbell offset, or doorbell value. Requests with zero count, count 25, namespace overflow, or length mismatch fail closed before any I/O SQ doorbell write. The release marker includes full-transfer FNV-1a hashes for the WRITE and read-back records so the kernel-side proof is not limited to the first two 16-byte block digests; the userspace smoke also compares all 12 KiB byte-for-byte. Proof: make run-cloud-provider-nvme-blockdevice-multiprp-window.

  • cloud-prod-nvme-blockdevice-flush-local-proof arms the NvmeBrokered arm’s flush @3 (fail-closed until now): it authors a real NVMe FLUSH (NVM command-set opcode 0x00, NSID-scoped, no data transfer) through the brokered sync command machinery and proves a writeBlocks then flush returns CQE status 0 and the written block survives the flush. A new parameter-free op nvme_brokered_io_sync_flush_op_for_cap(handle, owner) (kernel/src/device_manager/stub.rs) drives the shared single-call body nvme_brokered_io_sync_command with a FLUSH SyncIoParams (opcode = 0x00, command_id = 8, slba = 0, block_count = 0), rotating the I/O index off a kernel-side flush sequence (next_flush_io_index() => index 1, after the WRITE at 0 and before the read-back at 2). The shared body learns the FLUSH shape (gated on the opcode): it skips the one-PRP1-page data bound, authors the SQE with NSID only and PRP1 = 0/PRP2 = 0/CDW10..15 = 0 (no data page touched), and the WRITE/READ data-bearing path stays byte-identical for non-FLUSH opcodes. Under cloud_nvme_blockdevice_flush_proof (implies and supersedes cloud_nvme_blockdevice_writeblocks_proof, the flush proof’s true sibling, so the write/read arms and the whole brokered I/O chain are reused unchanged), BlockDeviceCap::nvme_flush returns () when the FLUSH was authored + the SQ doorbell rung + the completion consumed + CQE status 0, failing closed otherwise. The proof (kernel/src/cap/nvme_blockdevice_flush_proof.rs, a clone of the writeblocks module) drives the full bring-up then writeBlocks(5, 2, data), flush(), and readBlocks(5, 2), comparing the post-flush read-back byte-for-byte to the bytes written. It emits one cloudboot-evidence: provider-nvme-blockdevice-flush <token> marker labeled flush_path=blockdevice-flush flush_method=3 nvme_flush_opcode=0x00 flush_cqe_status=0 write_then_flush_ok=true flush_data_transfer=none prp1=0 prp2=0 write_readback_after_flush_match=true reboot_persistence=deferred durability_proof=flush-completion-only virtio_flush_regression=green. Proof: make run-cloud-provider-nvme-blockdevice-flush. No schema/binding change (flush @3 () -> () round-trips through existing bindings with its empty params/result). Future work (not yet implemented): an NVMe reboot-persistence pass and crash-consistency where the FLUSH barrier specifically changes the survival outcome (a flushed write surviving a forced poweroff an unflushed one would not), routing File.sync / the writable-fs / persistent-store sync through this FLUSH, a dedicated I/O-completion Interrupt route on the data path, NLB>1 spanning multiple PRP pages with a PRP list, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • cloud-prod-nvme-blockdevice-reboot-persistence-local-proof closes the reboot-persistence gap the flush proof named first: it proves a normally committed

    • FLUSHED write survives a CLEAN reboot through the same BlockDevice interface, the two-boot analogue of run-storage-persist on the NVMe arm. The Makefile recipe (make run-cloud-provider-nvme-blockdevice-reboot-persistence) creates ONE nvme.raw image and boots the non-qemu cloudboot kernel over it TWICE WITHOUT regenerating it between boots. The provider self-selects its boot phase by probing the LBA 5..6 window through readBlocks(5, 2) @0 – the data window itself is the guard sentinel: boot 1 reads back all-zero (fresh namespace), takes the writer branch, and issues writeBlocks(5, 2, data) @1 + a real flush() @3 (CQE status 0); QEMU restarts against the SAME backing file and boot 2 reads back the known payload, takes NO writer branch, and the single read-back verifies persistence. The proof reuses the landed writeBlocks @1 / flush @3 / readBlocks @0 arms and the brokered sync command machinery unchanged; the only kernel-internal additions are a flat per-boot single-call I/O op log (reworked from the flush proof’s rigid WRITE -> FLUSH -> read-back state machine so the verifier boot can record a READ with no prior WRITE in the same boot), the data-window phase select, and the cross-boot phase=1|2 marker labels. The cap-waiter route + headline marker come from kernel/src/cap/nvme_blockdevice_reboot_persistence_proof.rs (a clone of the flush module under cloud_nvme_blockdevice_reboot_persistence_proof, which implies and supersedes cloud_nvme_blockdevice_flush_proof). Each boot emits one cloudboot-evidence: provider-nvme-blockdevice-reboot-persistence <token> marker carrying its phase (phase=1 ... write_then_flush_ok=true flush_cqe_status=0 boot_role=writer-flush on boot 1; phase=2 ... reboot_persistence_match=true boot_role=verifier durability_proof=clean-reboot-persistence on boot 2). The reboot-persistence gate is the cross-boot correlation: boot 1’s persisted block digests equal boot 2’s read-back block digests (and both equal the known payload). No schema/binding change. Future work (not yet implemented): crash-consistency where the FLUSH barrier specifically changes the survival outcome under an induced mid-flush crash (the analogue of run-storage-writable-recovery), routing File.sync / the writable-fs / persistent-store sync through this FLUSH, a dedicated I/O-completion Interrupt route on the data path, NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production.
  • cloud-prod-nvme-blockdevice-flush-crash-consistency-local-proof covers the flushed-write-survives half of crash-consistency: it proves a normally committed + FLUSHED write survives a FORCED poweroff (an abrupt kill -9 of the QEMU process AFTER the flush barrier completed), the NVMe BlockDevice analogue of run-storage-writable-recovery. The Makefile recipe (make run-cloud-provider-nvme-blockdevice-flush-crash-consistency) creates ONE nvme.raw image, boots the non-qemu cloudboot kernel over it in the BACKGROUND (boot 1: empty namespace -> writer branch -> writeBlocks(5, 2, data) @1

    • real flush() @3, CQE status 0), watches the kernel log for the bounded arming marker [nvme-blockdevice-flush-crash-consistency] kernel: flushed write armed; awaiting forced poweroff, kill -9s the QEMU PID (the forced poweroff AFTER the flush barrier), then boots a SECOND time over the SAME file WITHOUT regenerating it (boot 2: verifier -> single readBlocks(5, 2) @0 read-back). The proof reuses the reboot-persistence predecessor’s two-boot phase select, the landed writeBlocks @1 / flush @3 / readBlocks @0 arms, and the brokered sync command machinery unchanged; the only kernel-internal additions over the predecessor are the phase-1 arm-and-spin window after the flush (on_release emits the arming marker and spins forever so the recipe can kill -9 at that point) and the forced-poweroff marker labels. The cap-waiter route + headline marker come from kernel/src/cap/nvme_blockdevice_flush_crash_consistency_proof.rs (a clone of the reboot-persistence module under cloud_nvme_blockdevice_flush_crash_consistency_proof, which implies and supersedes cloud_nvme_blockdevice_reboot_persistence_proof). Boot 1 emits one cloudboot-evidence: provider-nvme-blockdevice-flush-crash-consistency <token> marker carrying phase=1 ... write_then_flush_ok=true flush_cqe_status=0 armed_forced_poweroff=true boot_role=writer-flush-arm before the spin; boot 2 emits one carrying phase=2 ... flush_survives_forced_poweroff=true boot_role=verifier durability_proof=flush-survives-forced-poweroff. The crash-consistency gate is the cross-boot correlation: boot 1’s persisted block digests equal boot 2’s read-back block digests (and both equal the known payload), AND boot 1 reached the arm-and-spin window (was forcibly killed, did not take the verifier branch). No schema/binding change. Scoped honestly: “an unflushed write rolls back” is NOT provable under QEMU’s -device nvme cache=writeback model (the host page cache survives kill -9), so the differential-rollback half is NOT claimed (unflushed_rollback=not-provable-under-qemu-nvme-model). Future work (not yet implemented): a dedicated I/O-completion Interrupt route on the data path, NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production. (Both higher-level consumer FLUSH routings are now closed: the writable-fs File.sync half by cloud-prod-nvme-consumer-sync-to-flush-local-proof and the persistent-Store put-commit half by cloud-prod-nvme-persistent-store-sync-to-flush-local-proof, both below.)
  • cloud-prod-nvme-dedicated-io-completion-interrupt-local-proof moves the NVMe BlockDevice.writeBlocks @1 / readBlocks @0 data-completion handoff off the synchronous I/O-CQ poll return path and onto a dedicated data Interrupt route. The spec basis remains NVMe Base Specification 1.4 submission/completion queue doorbells and completion entries (§3 controller registers, §4 queue management and CQ phase/status handling, §6 NVM WRITE / READ commands). The proof keeps table entry 0 for the CREATE I/O CQ/SQ admin completions and adds table entry 1 for the data I/O CQ completions; both routes are kernel-injected cap-waiter MSI-X routes, not a device-autonomous interrupt claim. The implementation entry points are kernel/src/cap/nvme_io_completion_interrupt_proof.rs (init, invoke_wait, invoke_acknowledge, poll_blockdevice_completions, emit_marker), kernel/src/device_manager/stub.rs (nvme_brokered_io_completion_interrupt_submit_op_for_cap, nvme_brokered_io_completion_interrupt_complete_op_for_cap, nvme_io_completion_interrupt_submit_record_buffers_live), and kernel/src/cap/block_device.rs (nvme_interrupt_write_blocks, nvme_interrupt_read_blocks, call_with_context). The manager still authors queue bases and PRP1 from the live DMAPool ledger, copies the caller’s write payload into the parked write-data page, consumes the I/O CQ entry at Interrupt.acknowledge, advances the I/O CQ head doorbell (0x100c), and posts the deferred BlockDevice completion only after the bounded caller CQ has room. The proof (make run-cloud-provider-nvme-io-completion-interrupt) drives writeBlocks(5, 2, data), waits/acks the data route, observes the deferred write completion, then drives readBlocks(5, 2), waits/acks the data route, and receives the read bytes through the standard block_device::read_blocks_results data field. The headline marker cloudboot-evidence: provider-nvme-io-completion-interrupt <token> pins create.entry.0, io.entry.1, data_route_distinct_from_create_route=true, four dispatches/four deferred EOIs, both deferred BlockDevice completions posted, and a byte-for-byte write/read-back match. Scoped honestly: queue-base and PRP addresses remain hidden (host_physical_user_visible=0, iova_export=disabled-future-only, prp_source=manager-ledger); multi-PRP windows (count > 8), provider-written PRP/SGL/address lanes, live cloud, a second namespace, FUA/DSM, and device-autonomous MSI-X delivery remain future work.

  • cloud-prod-readonly-fs-over-nvme-blockdevice-local-proof provides a readonly_fs-style consumer over the NVMe BlockDevice arm: the read-only filesystem mount reads its sectors through the NVMe BlockDevice cap instead of the kernel-owned virtio-blk free functions. kernel/src/cap/readonly_fs.rs gains a BlockSource seam abstracting the two reads a mount needs (device geometry + range read). The always-built Virtio variant routes to the same crate::virtio free functions, so make run-storage-fs stays byte-identical; the Nvme variant (built only under cloud_readonly_fs_over_nvme_proof) reads through a granted NVMe-backed BlockDevice – geometry from the IDENTIFY Namespace claim (see the READ-arm graduation entry), each chunked range read through nvme_brokered_io_sync_read_window_op_for_cap, one 4 KiB PRP1 page per call. Because the brokered controller is brought up by the userspace provider, the NVMe root Directory (granted via read_only_fs_root) defers its mount-parse to the first Directory.open. The proof (kernel/src/cap/readonly_fs_over_nvme_proof.rs, a clone of the arbitrary-LBA module) drives the full bring-up, then seeds a tiny CAPOSRO1 image through the repurposed @20 op (one manager-baked sector per call: superblock @ LBA 0, entry table @ LBA 1, file data @ LBA 2), mounts the filesystem over the NVMe BlockSource, opens the one seeded file, reads it, and compares the bytes. It emits one cloudboot-evidence: provider-readonly-fs-over-nvme <token> marker labeled read_path=readonly-fs-over-blockdevice fs_format=CAPOSRO1 block_source=nvme-blockdevice file_match=true superblock_via_nvme=true entry_table_via_nvme=true extent_via_nvme=true after the kernel verifies each read-back block-0 digest against the baked image. Proof: make run-cloud-provider-readonly-fs-over-nvme. The malformed-image fail-closed paths (bad superblock magic, out-of-range entry-table or file extent) are the unchanged shared mount_root_inner/parse_entries validation in kernel/src/cap/readonly_fs.rs – the BlockSource seam swaps only the block-read backend, so the existing MountError checks covered by make run-storage-fs apply identically over the NVMe BlockSource; the NVMe arm additionally rejects an over-range range read with the arbitrary-LBA arm’s fail-closed error. Future work (not yet implemented): a multi-file directory walk / Directory.list traversal over NVMe, files whose extents span many one-PRP1-page chunks, NVMe write/flush durability through BlockDevice (the image is seeded via the manager-owned @20 op, not writeBlocks), a dedicated I/O-completion Interrupt route, and graduating the NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into always-built production.

  • cloud-prod-readonly-fs-over-nvme-multifile-dirwalk-local-proof extends the read-only filesystem to multi-file directories: it lists a directory with more than one entry and reads two distinct files over the NVMe BlockDevice cap, one of which spans multiple 4 KiB chunks. The baked image (kernel/src/cap/readonly_fs_over_nvme_multifile_proof.rs, a clone of the single-file module) grows to 12 sectors – superblock @ LBA 0, a two-record entry table @ LBA 1, a one-sector small file @ LBA 2, and a 9-sector large file @ LBA 3..11 carrying a deterministic position-dependent byte pattern. The large File.read covers nine sectors, so the read_range chunk loop issues TWO BlockDevice.readBlocks @0 calls (an 8-sector chunk @ LBA 3 + a 1-sector chunk @ LBA 11) – the multi-chunk path the single-file arm never exercised. The proof identifies each recorded read by (slba, count) and verifies it byte-for-byte with a per-read FNV-1a-64 over the full transfer (computed in device_manager alongside the block digests), so a dropped trailing chunk fails closed. Because per-sector seeding (12) plus the filesystem reads (5) issue 17 single I/O commands and the monotonic I/O SQ/CQ index must stay inside one first CQ pass, the build raises device_manager::stub::NVME_IO_QUEUE_DEPTH from 8 to 32 (create cdw10=0x001f0001); the change is inert for every other NVMe proof build. It emits one cloudboot-evidence: provider-readonly-fs-over-nvme-multifile <token> marker labeled dir_entry_count=2 file_count=2 files_distinct=true large_file_full_match=true large_file_read_blocks_calls=2 superblock_via_nvme=true entry_table_via_nvme=true extents_via_nvme=true (plus the single-file arm’s discipline labels). Proof: make run-cloud-provider-readonly-fs-over-nvme-multifile; the virtio mount path stays byte-identical (make run-storage-fs). Future work (not yet implemented): NVMe write/flush durability through BlockDevice, a dedicated I/O-completion Interrupt route on the data path, NLB > 1 spanning multiple PRP pages in a single call, sub-directory trees, and graduating the NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into always-built production.

  • cloud-prod-persistent-store-over-nvme-blockdevice-local-proof provides a writable consumer over the NVMe BlockDevice write arm: the disk-backed persistent Store mounts over the NVMe BlockDevice write arm and proves a put-then-get durability round-trip. kernel/src/cap/persistent_store.rs gains a read+write BlockSource seam (mirroring readonly_fs::BlockSource but with a write_blocks method). The always-built Virtio variant routes to the same crate::virtio free functions byte-identically (including the data_region_base_lba() installable-disk offset, folded into the variant), so make run-storage-persist stays green; the Nvme variant (built only under cloud_persistent_store_over_nvme_proof) reads through nvme_brokered_io_sync_read_window_op_for_cap and writes through nvme_brokered_io_sync_write_window_op_for_cap, one 4 KiB PRP1 page per call. Because the brokered controller is brought up by the userspace provider, the NVMe root Store (granted via persistent_store) defers its mount-parse to the first Store call. The proof (kernel/src/cap/persistent_store_over_nvme_proof.rs, a clone of the writeblocks module) drives the full bring-up, then seeds a CAPOSST1 superblock + empty entry table through the repurposed @20 op (superblock @ LBA 0, entry table @ LBA 1), and exercises the granted Store: Store.put writes the data extent (LBA 2), entry-table sector, and superblock through BlockDevice.writeBlocks @1, and Store.get reads the extent back through BlockDevice.readBlocks @0. The kernel attests the put WRITE and get READ block-0 digests both equal the payload digest and differ from the pre-put (zero) extent, and userspace compares the returned bytes byte-for-byte. It emits one cloudboot-evidence: provider-persistent-store-over-nvme <token> marker labeled write_path=store-put-over-blockdevice-writeblocks read_path=store-get-over-blockdevice-readblocks consumer=persistent-store store_iface=Store block_iface=BlockDevice store_format=CAPOSST1 write_method=1 read_method=0 put_get_roundtrip_match=true durability_attested=true virtio_regression=green. Because the round-trip issues 8 single I/O commands (2 seed WRITEs + 2 deferred-mount READs + 3 Store.put WRITEs + 1 Store.get READ) whose last monotonic CQ head reaches 8 – past the default depth-8 first pass – the build raises device_manager::stub::NVME_IO_QUEUE_DEPTH from 8 to 16; the change is inert for every other NVMe proof build. Proof: make run-cloud-provider-persistent-store-over-nvme. No schema/binding change (Store.put/get/has/delete and BlockDevice.writeBlocks @1/readBlocks @0 round-trip through existing bindings). Future work (not yet implemented): routing the writable filesystem (CAPOSWF1) over the NVMe write arm, a real NVMe FLUSH on flush @3 (stays fail-closed), an NVMe reboot-persistence pass, a dedicated I/O-completion Interrupt route on the data path, NLB > 1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • cloud-prod-writable-fs-over-nvme-blockdevice-local-proof mounts the full disk-backed writable filesystem over the NVMe arm (kernel/src/cap/writable_fs.rs, the CAPOSWF1 node-table tree with mkdir / rename / remove and a fail-closed single-writer policy). writable_fs carries a read+write BlockSource seam mirroring persistent_store’s: the Virtio variant (built only in the qemu/ installable storage builds) routes to the crate::virtio free functions byte-identically (folding the data_region_base_lba() offset), so make run-storage-writable / make run-storage-writable-recovery stay green; the Nvme variant (built only under cloud_writable_fs_over_nvme_proof) reads through nvme_brokered_io_sync_read_window_op_for_cap and writes through nvme_brokered_io_sync_write_window_op_for_cap, one 4 KiB PRP1 page per call. Because writable_fs uses a process-wide singleton volume, the NVMe writable_fs_root grant stages the live device_mmio handle and defers the singleton mount-parse to the first Directory/File call. The proof (kernel/src/cap/writable_fs_over_nvme_proof.rs, a clone of the persistent-store module that supersedes and drops it) seeds a CAPOSWF1 superblock + root + one seeded file through the @20 op, contiguously from LBA 256 (superblock @256, node table @257, seeded file extent @258), then exercises the granted filesystem. READ arm: opening the seeded file triggers the deferred mount, which reads the seeded extent (@258) back through BlockDevice.readBlocks @0; File.read returns the RAM copy the mount loaded. WRITE arm: File.write to a fresh file lands a bump-allocated data extent (@259) + node-record + superblock through BlockDevice.writeBlocks @1. Because File.read serves the RAM content cache the mount loaded (not a fresh disk read), a same-extent disk re-read of a just-written file – which needs a remount/ reboot – is out of scope; the matching block-0 digests prove the same payload traversed both device arms, each device-acked. The single-writer policy is proven intact: a File.write through a second granted writable_fs_root cap fails closed. It emits one cloudboot-evidence: provider-writable-fs-over-nvme <token> marker labeled write_path=file-write-over-blockdevice-writeblocks read_path=file-read-over-blockdevice-readblocks consumer=writable-fs fs_iface=Directory file_iface=File block_iface=BlockDevice fs_format=CAPOSWF1 write_method=1 read_method=0 write_read_roundtrip_match=true durability_attested=true single_writer_policy=enforced second_writer_denied=true recovery_over_nvme=deferred virtio_regression=green. The round-trip issues 11 single I/O commands (3 seed WRITEs + 3 deferred-mount READs + 5 File.write WRITEs); NVME_IO_QUEUE_DEPTH stays 16. Proof: make run-cloud-provider-writable-fs-over-nvme. No schema/binding change. Future work (not yet implemented): the unclean-shutdown / forced- poweroff recovery window (recovery_crash_after_record) over the NVMe arm (the analogue of run-storage-writable-recovery, proved on virtio here), a real NVMe FLUSH on flush @3, an NVMe reboot-persistence pass, a dedicated I/O-completion Interrupt route on the data path, NLB > 1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into production.

  • cloud-prod-writable-fs-over-nvme-recovery-local-proof proves the unclean-shutdown / forced- poweroff RECOVERY window (recovery_crash_after_record, the record-sector-written- but-superblock-not-yet-committed window in kernel/src/cap/writable_fs.rs) over the NVMe BlockDevice arm. A new cloud_writable_fs_over_nvme_recovery_proof feature implies (and supersedes the happy-path proof module/route/init of) cloud_writable_fs_over_nvme_proof and widens the storage_writable_recovery crash-window cfg gate so the same recovery-orphan.txt sentinel arms an induced forced poweroff when the writable filesystem is NVMe-backed. The recovery cap-waiter module (kernel/src/cap/writable_fs_over_nvme_recovery_proof.rs) reuses the NVMe BlockSource arm, deferred mount, third WritableFsRoot grant arm, window ops, and I/O queue create unchanged. Unlike the happy-path proof, the CAPOSWF1 image is HOST-BUILT (tools/mkstore-image --writable-nvme lays an empty superblock + root-only node table) rather than seeded through @20: a two-boot SAME-image recovery flow cannot re-seed on pass 2 without clobbering the pass-1 committed state, and the read ordering does not depend on a per-boot seed here (mirroring the virtio run-storage-writable-recovery proof, which also boots a host-built image twice). make run-cloud-provider-writable-fs-over-nvme-recovery boots QEMU twice with -device nvme against one shared raw drive file: pass 1 commits a File.write + sub-directory through writeBlocks @1, allocates the sentinel (its record sector lands on the namespace), and spins; the harness kill -9s QEMU before the superblock commit. Pass 2 boots the SAME file, mounts by reading the old superblock + node table back through readBlocks @0, and asserts the recovered tree omits the orphan slot (exactly the committed entries remain), preserves the committed mutation (file size + content), accepts a usable post-recovery write, and denies a second-grant write (single-writer policy). The userspace smoke emits one cloudboot-evidence: provider-writable-fs-over-nvme-recovery <token> marker labeled crash_window=record-written-superblock-uncommitted orphan_slot_ignored=true committed_mutation_survived=true post_recovery_write_ok=true recovery_over_nvme=true single_writer_policy=enforced durability_basis=host-page-cache real_flush=deferred reboot_persistence=deferred io_completion=polled interrupt_wait=not-used-on-data-path virtio_recovery_regression=green live_cloud=not-attempted; the kernel proof module’s on_release independently attests the cap-waiter route lifecycle + the two CREATE I/O queue dispatch/ack cycles. The two CREATE I/O queue commands keep their production Interrupt.wait/acknowledge cap-waiter cycles; the data path stays polled. Bounded-proof caveat: one record-vs-commit window, host-page-cache durability (the two passes share one backing file; a kill -9 preserves the host page cache), NOT media crash-consistency and NOT a real NVMe FLUSH barrier. No schema/binding change. Future work (not yet implemented): a real NVMe FLUSH on flush @3, an NVMe clean- reboot-persistence pass, NLB > 1 spanning multiple PRP pages, a dedicated I/O-completion Interrupt route on the data path, and graduating the NVMe data plane into production. Proof: make run-cloud-provider-writable-fs-over-nvme-recovery.

  • cloud-prod-nvme-consumer-sync-to-flush-local-proof routes a consumer-level File.sync @4 to a real BlockDevice.flush @3 NVMe FLUSH media barrier instead of a write-side no-op. writable_fs::BlockSource carries a flush() arm (the Virtio variant returns Ok(()) – the driver negotiates no VIRTIO_BLK_F_FLUSH, so virtio File.sync stays a byte-identical no-op and make run-storage-writable stays green; the Nvme variant drives nvme_brokered_io_sync_flush_op_for_cap with the same success predicate the read/write arms apply), and File.sync @4 (writable_fs.rs) routes through it AFTER the claim_writer gate. The feature cloud_nvme_consumer_sync_to_flush_proof composes cloud_nvme_blockdevice_flush_crash_consistency_proof (arming the real flush @3 op-for-cap) and cloud_writable_fs_over_nvme_proof (the consumer arm), dropping both predecessors’ proof modules. The proof (kernel/src/cap/nvme_consumer_sync_to_flush_proof.rs) seeds the CAPOSWF1 image, then Directory.open(CREATE) + File.write (through writeBlocks @1), then File.sync() – which issues the real NVMe FLUSH (opcode 0x00, CQE status 0, no data transfer: PRP1 = 0, PRP2 = 0) – then File.read confirming the bytes survive. The single-writer policy is shown intact: a File.sync through a second granted cap fails closed BEFORE any FLUSH is issued (denied_sync_issues_no_flush=true), and the kernel asserts exactly one consumer-sync FLUSH (status 0) was recorded. It emits one cloudboot-evidence: provider-nvme-consumer-sync-to-flush <token> marker labeled consumer_sync_path=File.sync-to-nvme-flush sync_method=4 flush_method=3 nvme_flush_issued_by_consumer_sync=true nvme_flush_opcode=0x00 flush_cqe_status=0 write_sync_read_roundtrip_match=true single_writer_policy=enforced durability_proof=consumer-sync-issues-real-flush virtio_sync_noop=byte-identical. The round-trip issues 12 single I/O commands (3 seed WRITEs + 3 mount READs + 5 File.write WRITEs + 1 File.sync FLUSH); NVME_IO_QUEUE_DEPTH stays 16. Proof: make run-cloud-provider-nvme-consumer-sync-to-flush. No schema/binding change. The bounded claim is consumer-sync-issues-real-flush, NOT a power-loss survival differential (unflushed_rollback=not-provable-under-qemu-nvme-model; the cross-boot forced-poweroff differential stays as the crash-consistency proof established it). Future work (not yet implemented): routing the persistent Store’s commit path through the FLUSH, graduating the NVMe data plane into production, NLB > 1 spanning multiple PRP pages, and a dedicated I/O-completion Interrupt route on the data path.

  • cloud-prod-nvme-persistent-store-sync-to-flush-local-proof routes the persistent Store’s put-commit path to a real BlockDevice.flush @3 NVMe FLUSH media barrier. The Store has NO sync schema method (put @0 / get @1 / has @2 / delete @3 only), so the routing point is the existing put-commit path, not a new method: persistent_store::BlockSource gains a flush() arm (the Virtio variant returns Ok(()) – no VIRTIO_BLK_F_FLUSH, so the virtio Store.put commit stays a byte-identical no-op and make run-storage-persist stays green; the Nvme variant drives nvme_brokered_io_sync_flush_op_for_cap with the same success predicate the read/write arms apply, but only when the flush lineage is composed so the plain make run-cloud-provider-persistent-store-over-nvme commit stays a no-op), and put_blob (persistent_store.rs) issues it AFTER the flush_superblock write (the ordering commit point) succeeds. A FLUSH that fails closed rolls back the in-RAM entry_count/next_free_sector so no live index insert occurs. The feature cloud_nvme_persistent_store_sync_to_flush_proof composes (and drops/supersedes) the cloud_nvme_consumer_sync_to_flush_proof lineage (transitively the real flush @3 op and the persistent-store-over-NVMe read+write seam). The proof (kernel/src/cap/nvme_persistent_store_sync_to_flush_proof.rs) seeds the CAPOSST1 image, then Store.put(data) – which writes the data extent / entry sector / superblock through writeBlocks @1, then issues the real NVMe FLUSH (opcode 0x00, CQE status 0, no data transfer: PRP1 = 0, PRP2 = 0) after the superblock commit – then Store.get(hash) confirming the bytes survive. The kernel asserts exactly one Store-commit FLUSH (status 0) recorded AFTER the superblock write (superblock_commit_before_flush=true) and emits one cloudboot-evidence: provider-nvme-persistent-store-sync-to-flush <token> marker labeled consumer_commit_path=store-put-to-nvme-flush put_method=0 flush_method=3 nvme_flush_issued_by_store_commit=true nvme_flush_opcode=0x00 flush_cqe_status=0 superblock_commit_before_flush=true put_get_roundtrip_match=true failed_flush_issues_no_live_entry=true virtio_commit_noop=byte-identical durability_proof=store-commit-issues-real-flush. The round-trip issues 9 single I/O commands (2 seed WRITEs + 2 mount READs + 3 Store.put WRITEs + 1 put-commit FLUSH + 1 Store.get READ). Proof: make run-cloud-provider-nvme-persistent-store-sync-to-flush. No schema/binding change. The bounded claim is store-commit-issues-real-flush, NOT a power-loss survival differential (unflushed_rollback=not-provable-under-qemu-nvme-model). Future work (not yet implemented): graduating the NVMe data plane out of the per-proof features into always-built production, with the dedicated I/O-completion Interrupt route on the data path and NLB > 1 spanning multiple PRP pages.

  • cloud-prod-nvme-sync-io-state-seam-always-built-local-proof extracts the brokered-NVMe synchronous-I/O state the shared op body depends on into ONE always-built module device_manager::nvme_sync_io_state (kernel/src/device_manager/nvme_sync_io_state.rs), compiled in the default no-proof cargo build (not behind any cloud_nvme_*_proof feature). The seam owns: the functional I/O SQ/CQ reservation cursor (reserve_io_slot() = a queue-depth-bounded first-pass slot reservation, with one live in-flight reservation, so a single-call command cannot reuse a stale CQE before the created queue wraps); the admissions predicate (sync_{read,write,flush}_admitted = bounded-ledger-not-full plus no active reservation); and the ordered SyncIoRecord op-log ledger (record_sync_{read,write,flush}, DIGEST_BYTES, op kind implied by record.opcode = 0x00 FLUSH / 0x01 WRITE / 0x02 READ). The shared body nvme_brokered_io_sync_command and the nvme_brokered_io_sync_{read_window,write_window,flush}_op_for_cap / nvme_brokered_io_sync_read_bytes_op_for_cap entries (kernel/src/device_manager/stub.rs) now record/admit/index through this seam instead of the per-proof nvme_io_proof alias; the alias stays only for the genuinely per-proof create/io-phase/seed ledgers (record_create_*, record_io_*, next_io_phase, next_seed_slba, seed_image_sector). All 15 NVMe BlockDevice proof modules (kernel/src/cap/*_proof.rs) delegate to the seam: each deletes its private SyncIoRecord/SyncHandoff/record/admit/index copy and reconstructs its release-marker view from the seam’s ordered op-log snapshot, byte-identical. The create-ordering / write-before-read orderings the per-proof harnesses formerly folded into their admit predicates are proof-harness assertions each proof still re-derives in its release marker; the always-built admit keeps only the production-honest bounded-ledger invariant (a real read needs no prior write). Code-location refactor only – no schema/binding change, no new device behavior; default cargo build and cargo build --features qemu stay warning-free (the seam carries a module-level #[allow(dead_code)] for its dormant not-yet-activated entry points). This UNBLOCKS the read-arm graduation: the read body’s sync-I/O symbols now resolve in the default build, so the graduation can make the read body always-built and gate activation behind a fail-closed runtime probe while the proof exercises the same always-built seam. Proof: every existing NVMe BlockDevice proof stays green (make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read and the rest of the chain) and make run-net is byte-identical.

These brokered capabilities target the no-IOMMU QEMU/GCP lane, where queue-base and PRP addresses are materialized by the kernel/device manager from the live ledger. On a direct-remapping/vIOMMU gate the provider-written validator model (nvme-userspace-bind-and-controller-bringup) applies instead. The PCI metadata-only discovery summary (pci: nvme metadata ...) that also runs on make run-pci-nvme is the separate enumeration-evidence surface in kernel/src/pci.rs.

1. Spec basis

  • Device: NVM Express PCI controller. PCI class 0x01 (mass storage), subclass 0x08 (NVM), programming interface 0x02 (NVM Express). Detected by PciDevice::is_nvme_controller (kernel/src/pci.rs, NVME_CLASS_MASS_STORAGE / NVME_SUBCLASS_NVM / NVME_PROG_IF_NVM_EXPRESS). QEMU instance: -device nvme,drive=...,serial=... on the q35 machine.
  • Authoritative spec: NVM Express Base Specification (NVMe 1.4 / 2.0). The fields the validator relies on:
    • Controller registers CAP, CC (with CC.EN controller enable), AQA, ASQ, ACQ (NVMe Base Spec §3.1 controller register map). ASQ/ACQ base addresses have bits 11:0 reserved → 4 KiB page-aligned.
    • Submission/completion queue base addresses and the per-queue doorbell registers in the doorbell stride region (§3.1.x, §7.6 queue setup).
    • Physical Region Page (PRP) entries PRP1/PRP2 and the PRP List (§4.3): PRP list pages and list-pointer PRP2 entries are page-aligned; a transfer that needs more than one PRP list page chains a further list, which this bounded subset does not follow.
  • Reference driver (optional cross-check): the Linux drivers/nvme/host/ queue-setup and PRP-build paths (nvme_setup_prps, nvme_pci_configure_admin_queue).

2. Wire format (validator-relevant subset)

The validator reads only the device-visible addresses a single doorbell newly publishes, plus the byte extent of the region each names. It does not decode command opcodes, data payloads, or completion entries. Scanned items are modeled by ScanItem (kernel/src/cap/nvme_doorbell_validator.rs).

  • Queue-base registers (ScanItem::QueueBase): the ASQ/ACQ admin queue bases (scanned on the CC.EN / queue-arm write, ScanKind::QueueArm) and the I/O SQ/CQ bases. The named region is entries × entry_size bytes (e.g. an admin SQ is depth × 64, an admin CQ is depth × 16). Required alignment: page (4 KiB).
  • SQ entry PRP pointers (ScanItem::Prp): for each NVMe command newly made visible by an SQ tail doorbell (ScanKind::SqTailDoorbell), the PRP1 data pointer, the PRP2 data-or-list pointer, and one level of PRP-list indirection. list_depth counts indirection already followed (0 = a PRP carried in the SQE, 1 = a pointer inside the single PRP list page); list_depth > 1 is the out-of-subset deeper-chain case and fails closed (MAX_PRP_LIST_DEPTH). The named region is the transfer length (PRP data) or one page (a PRP list). Required alignment: page (4 KiB).

The scan is on-notify only: the provider may freely write its own mapped DMA pages between doorbells; nothing device-reachable happens until a doorbell rings, which is the single choke point the validator guards. Cost is O(descriptors published by this doorbell).

3. capOS mapping

The validator is the kernel half of the Model B genuine-userspace-driver model (docs/proposals/nvme-model-b-doorbell-dma-validator.md): the provider writes the device-visible queue-base and PRP addresses itself, and the kernel validates them on the doorbell path rather than minting them (Model A, the unchanged virtio-net TX path).

  • Authority gate: the live doorbell-path hook derives the owning provider’s identity (OwnerToken) and live grant generation from the DeviceMmio grant record in the device-manager ownership ledger, never from provider-supplied bytes. The cfg(qemu) self-test resolves owner/generation from synthetic windows only.
  • DeviceMmio: the validator is invoked from the pre-write step of the NVMe doorbell/queue-arm selected-write DeviceMmio claim (kernel/src/cap/device_mmio.rs; the existing notifyDoorbell path is the Model A virtio-net claim and does not trigger the validator). The scan completes — accept or reject — before the doorbell write is allowed to take effect, so the device never sees an unvalidated descriptor batch. BAR0 / doorbell pages stay device-uncacheable, NX, capability-scoped.
  • DMAPool / window descriptor: for a direct-remapping/vIOMMU lane, a DmaWindow can name the owner’s domain-scoped IOVA range with a live generation, and provider-written values can be checked against that range. On the current no-IOMMU lane, there is no provider-visible non-host-physical device-address namespace; the manager owns the physical bounce pages and must materialize queue-base and PRP/SGL fields itself. See docs/dma-isolation-design.md (Provider-Written Addresses And No-IOMMU Brokered Bounce).
  • Interrupt: completion_wakes_waiter enforces the stale-completion gate — a completion wakes a waiter only if its submission scan was accepted and the generation it was validated under is still live; an unvalidated or retired-generation completion does not wake a waiter.
  • Fail-closed / validation rules (ScanReject, all reject with no doorbell write and no waiter wake): out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned, deep-prp-chain, stale-generation, and invalid-region. A doorbell rung after revoke/reset/regrant against a stale generation fails closed even when the byte value would have been in-window for the prior grant.
  • QEMU-emulable vs hardware-only: the validator mechanism and its hostile-scan invariants are end-to-end provable in QEMU (make run-pci-nvme, the nvme: validator ... proof lines). Live controller bring-up over a real NVMe controller — admin/I/O queue creation, IDENTIFY, and a bounded read with the validator gating the real doorbell — is QEMU-emulable too and is covered by the brokered bring-up capabilities (§4-§9), not the validator mechanism itself.

4. Userspace bind (read-only controller bring-up)

nvme-bind-claimed-mmio-read stands up the userspace storage-provider bind foundation over the existing DDF driver foundation. It binds the controller read-only – no register write, no DMA submission, and no doorbell – and so leaves the controller’s existing (firmware-initialized) state untouched.

  • Enumerate → claim → BAR0 preseed: bind_qemu_nvme_controller (kernel/src/pci.rs) runs for the first enumerated NVMe controller after the metadata summary. It preseeds the first decoded memory BAR (BAR0 controller registers) for brokered reads (devicemmio_grant_source::preseed_read32_for_device), then claims the function and parks it under DeviceOwner::ManagerGrantSource. On any staging failure it falls back to the pci: nvme no-authority/no-driver line (fail-closed): a partially-staged authority surface is never advertised.
  • Grant-source staging: the same device-agnostic {devicemmio,dmapool,interrupt}_grant_source::init_for_device the virtio path uses stage the bootstrap grants against the claimed NVMe handle. The virtio-net-specific provider-notify/doorbell selected-write claim is not staged here — that is the controller-enable path (§6). run-pci-nvme boots with no virtio devices, so the singletons are free for NVMe; run-net / run-ddf-provider-consumer (no -device nvme) keep the virtio bind untouched.
  • Brokered register reads: the nvme-bringup-smoke provider (demos/nvme-bringup-smoke/) holds the manifest-granted console/dmapool/device_mmio/interrupt caps and reads CAP (0x0, 0x4), VS (0x8), CC (0x14), and CSTS (0x1c) through DeviceMmio.read32 (the brokered boot-preseeded mapping in device_manager::read_devicemmio_u32). It proves the bound claim reaches a coherent NVMe BAR0 by requiring a live CAP (non-zero, non-floating) and a valid VS version (NVMe Base Spec §3.1.1/§3.1.2; QEMU reports 1.4.0), and reports the observed CC.EN/CSTS.RDY (§3.1.5/§3.1.6, bit 0 of each).
  • Firmware-initialized controller: under QEMU’s SeaBIOS BIOS boot, the NVMe boot-probe enables the controller (CC.EN=1, CSTS.RDY=1) before init runs, so the read-only bind observes a live controller. Bringing it to a known reset state (CC.EN=0, wait CSTS.RDY=0) before re-enabling with provider-owned admin queues is the controller-enable path’s responsibility (§6), not this read-only bind.
  • Proof line: the userspace [nvme-bringup-smoke] controller-bind ok ... mmio_read=brokered controller_state=firmware-enabled read proof, asserted by tools/qemu-pci-nvme-smoke.sh. The kernel bind line advanced from controller_init=read-only-bind to controller_init=reset-capable-bind when §5 added the CC selected-write claim to the same grant staging; the read proof itself is unchanged.

5. Userspace controller reset (selected-write CC claim)

nvme-controller-reset-selected-write is the first genuine userspace NVMe controller-register write: it brings the firmware-enabled controller to a known reset state. It does not enable the controller, program admin/IO queue bases, submit DMA, or ring a doorbell – controller enable publishes admin queue-base addresses and is the validator-gated path in §6.

  • NVMe CC selected-write claim: the NVMe bind stages the DeviceMmio grant region with a reset-only selected-write claim (DeviceMmioWrite32ClaimProvider::NvmeControllerRegister, device_manager::nvme_controller_register_grant_regionproofs::nvme_controller_register_region), scoped to the CC register (0x14, NVMe Base Spec §3.1.5). bind_qemu_nvme_controller (kernel/src/pci.rs) now calls devicemmio_grant_source::init_nvme_controller_for_device instead of the plain init_for_device, so the single granted cap carries both the brokered read surface and the CC write claim.
  • Value-flexible, scoped: unlike the virtio variants (which pin an exact (offset, value) pair), the NVMe claim is offset-scoped and value-flexible only for reset in validate_devicemmio_write32_claim (kernel/src/device_manager/qemu_full.rs): it admits any CC write whose CC.EN (bit 0) is clear – the read-modify-write reset – directly, fails closed on raw CC.EN=1 writes with devicemmio-nvme-cc-enable-raw-blocked, and fails closed on any write to a non-CC offset (unclaimed-register-write). Refused writes perform no MMIO.
  • Reset sequence: the nvme-bringup-smoke provider reads CC, writes it back with CC.EN cleared through DeviceMmio.write32 (the volatile MMIO write in device_manager::write_devicemmio_u32, resolved through the boot-preseeded BAR0 mapping), and polls CSTS (§3.1.6) until CSTS.RDY clears. QEMU clears CSTS.RDY synchronously on the CC.EN=1→0 write (nvme_ctrl_reset).
  • No DMA validator involvement: a reset write (CC.EN clear) publishes no queue-base or PRP addresses, so the Model B on-notify validator (§2/§3) is not invoked. The validator is invoked on the explicit brokered controller-enable queue-arm path (§6).
  • Proof lines (asserted by tools/qemu-pci-nvme-smoke.sh): pci: nvme userspace-bind ... controller_init=reset-capable-bind ... cc_selected_write=staged (kernel), [nvme-bringup-smoke] cc-raw-enable-refused ..., [nvme-bringup-smoke] non-cc-write-refused ..., and [nvme-bringup-smoke] controller-reset ok ... csts_rdy_before=1 csts_rdy_after=0 cc_en_after=0 reset_write=performed ... (userspace).

6. Brokered controller enable (no-IOMMU, manager-authored admin queues)

nvme-no-iommu-brokered-controller-enable enables the controller on the no-IOMMU make run-pci-nvme gate without exporting a host physical (== the device-visible address on the bounce shape) or raw IOVA to the provider. The provider invokes the explicit no-parameter DeviceMmio.brokeredNvmeControllerEnable verb (schema @6); the manager authors every address-bearing register and the selected CC value from its live DMA ledger. Raw DeviceMmio.write32(CC, value with CC.EN=1) fails closed before any MMIO side effect. The earlier provider-written Model B enable (nvme-userspace-bind-and-controller-bringup) stays blocked: it would require the provider to author a device-visible queue-base, which the reviewed iova_export=disabled-future-only discipline forbids on this gate.

  • Brokered admin queue memory: the provider allocates the admin submission and completion queue pages through the device’s DMAPool authority (DmaPool.allocateBuffer), maps each read-write to fill it, and unmaps. By convention the manager reads the admin SQ from pool slot 0 (NVME_ADMIN_SQ_POOL_SLOT) and the admin CQ from slot 1 (NVME_ADMIN_CQ_POOL_SLOT). The pages stay live in the manager ledger; DmaBuffer.info continues to report device_iova=0, iova_export=disabled-future-only, host_physical_user_visible=false.
  • Manager-authored queue-base registers: the @6 method dispatches through nvme_brokered_controller_enable_op_for_cap (kernel/src/device_manager/qemu_full.rs), which validates the cap against the live NVMe controller-register claim and then calls nvme_brokered_admin_queue_enable. It resolves the admin SQ/CQ pages (record.attached_dmapools[..].proof_buffers[slot].page), then authors AQA (0x24, zero-based admin queue sizes), ASQ (0x28/0x2c), and ACQ (0x30/0x34) from the ledger page physical addresses via nvme_authored_register_write (volatile writes resolved only through the boot-preseeded BAR0 mapping), and finally performs the manager-selected CC.EN | IOSQES=6 | IOCQES=4 write. No provider-supplied controller bits or address-bearing value reaches the controller.
  • Validator on the queue-arm path: before any register write the authored ASQ/ACQ bases are passed through the Model B on-notify DMA validator (crate::cap::nvme_doorbell_validator::validate_doorbell_scan, ScanKind::QueueArm). On this path the windows and the scanned items both derive from the same kernel ledger pages, so it is a self-consistency check on the kernel-authored bases: it proves page alignment and in-window containment of the named queue region (entries * entry_size – the admin SQ 128 B / CQ 32 B fit the 4 KiB page). The owner-identity, cross-owner-alias, host-physical, and stale-generation rejections are structurally unreachable here because both sides of each comparison come from the manager (the real authority gate against a stale/foreign page is the live-ledger membership check below); those hostile rejections are exercised by the bounded cfg(qemu) self-test (§3). A reject still fails closed before the CC.EN write.
  • Fail-closed before enable: raw write32(CC, CC.EN=1) returns devicemmio-nvme-cc-enable-raw-blocked before any MMIO. The explicit manager-op’s real authority gate is live-ledger membership – an enable request with the admin queue pages unallocated, freed, or in-flight returns nvme-admin-queues-not-armed (devicemmio-nvme-cc-enable-not-armed) with no MMIO side effect, covering the out-of-order manager operation and the post-free stale re-enable. A validator reject returns devicemmio-nvme-cc-enable-validator-reject.
  • Teardown under live admin queues: reset (CC.EN=0) quiesces the controller (CSTS.RDY clears) before the admin queue pages are reused; DmaBuffer.freeBuffer then scrubs each page before the frame is freed (page_scrubbed_before_frame_free=true), and a subsequent enable with the queue memory gone fails closed. The enable path submits no admin commands, so there are no live completions or waiters during teardown; the “stale/unvalidated completion does not wake a waiter” property (completion_wakes_waiter) is proven by the bounded cfg(qemu) self-test (§3), not by the live admin-queue teardown.
  • Proof lines (asserted by tools/qemu-pci-nvme-smoke.sh): [nvme-bringup-smoke] admin-queue-allocated ... (userspace), nvme: brokered-enable owner=nvme-storage trigger=manager-op admin_sq_slot=0 admin_cq_slot=1 validator=queue-arm scanned_items=2 aqa=0x00070007 cc=0x00460001 asq_authored=true acq_authored=true cc_en_write=performed cc_bits_selected_by=manager queue_base_source=manager-ledger host_physical_user_visible=false ... (kernel), [nvme-bringup-smoke] controller-enable ok ... cc_en_after=1 csts_rdy_after=1 ... brokered_enable_trigger=manager-op ..., [nvme-bringup-smoke] teardown-reset ok ... quiesced=true, [nvme-bringup-smoke] admin-queue-freed ... page_scrubbed_before_frame_free=true, and [nvme-bringup-smoke] stale-enable-refused ... brokered_enable_trigger=manager-op reason=nvme-admin-queues-not-armed (userspace).
  • Not in scope (this path): I/O queues, read/write commands, cloud evidence, and host-physical/IOVA export are out of scope for the enable path. hostile_hardware_isolation=not-claimed; the brokered no-IOMMU enable is not hostile-hardware isolation. One brokered IDENTIFY admin command is in §7.

7. Brokered admin command + IDENTIFY (no-IOMMU)

nvme-admin-queue-identify extends the brokered no-IOMMU lane to one admin command. After the §6 enable, the provider submits a single IDENTIFY (controller) admin command and consumes its completion from its own mapped admin CQ. As on the enable path, the manager authors every address-bearing field; the provider supplies only the non-addressing command dwords and the doorbell index. nvme-admin-interrupt-delivery then makes that completion interrupt-driven: the provider unmasks the admin completion interrupt route and blocks on Interrupt.wait, the kernel wakes the live waiter through the device-interrupt dispatch path, and only then is the completion consumed from the mapped CQ.

  • Admin command (wire subset): a 64-byte submission queue entry (NVMe Base Spec §4.2). The provider writes opcode 0x06 (IDENTIFY, §5.17) at byte 0, a command id at bytes 2:3, NSID=0, and CNS=0x01 (Identify Controller) in CDW10 at bytes 40:43 into the mapped admin SQ page. It leaves the address-bearing MPTR (bytes 16:23), PRP1 (bytes 24:31), and PRP2 (bytes 32:39) zero; the manager overwrites them. The IDENTIFY data structure is 4096 bytes, so a single page-aligned PRP1 covers it and PRP2 stays zero.
  • Doorbells: the same NvmeControllerRegister claim covers the admin SQ tail doorbell (0x1000) and CQ head doorbell (0x1004) – admin queue 0 with doorbell stride CAP.DSTRD=0 (NVMe Base Spec §3.1.24/§3.1.25; nvme_brokered_admin_sq_doorbell re-reads CAP.DSTRD and fails closed on a non-zero stride). The doorbell value (the tail/head index) is not address-bearing; the manager bounds it to <= NVME_ADMIN_QUEUE_DEPTH and performs the write.
  • Manager-authored PRP on the submit path: a write to the SQ tail doorbell is routed by validate_devicemmio_write32_claim (NvmeBrokeredWriteOp::AdminSqTailDoorbell) to nvme_brokered_admin_sq_doorbell (kernel/src/device_manager/mod.rs). It resolves the live admin SQ (slot 0), CQ (slot 1), and IDENTIFY data (slot 2, NVME_ADMIN_DATA_POOL_SLOT) ledger pages, authors the SQE’s MPTR/PRP1/PRP2 from the data page physical address through the SQ page’s HHDM mapping (nvme_author_admin_sqe_prp), fences, then rings the doorbell via nvme_authored_register_write. No provider-supplied address reaches the controller.
  • Validator on the SQ-tail path: before authoring the SQE the data-buffer PRP1 is passed through the Model B on-notify DMA validator (validate_doorbell_scan, ScanKind::SqTailDoorbell, ScanItem::Prp { list_depth: 0 }): page alignment and in-window containment of the 4 KiB data region against the data page’s own device-visible window. A reject returns devicemmio-nvme-admin-submit-validator-reject with no doorbell write. As on the queue-arm path the window and scanned item are both manager-derived, so the hostile owner/host-physical/stale rejections are exercised by the cfg(qemu) self-test (§3); the live authority gate is live-ledger membership.
  • Interrupt-driven completion wake (nvme-admin-interrupt-delivery): after ringing the SQ tail doorbell, the provider unmasks the admin completion interrupt route and blocks on Interrupt.wait. The route is the NVMe controller’s bootstrap Interrupt grant (DeviceOwner::ManagerGrantSource, MSI-X table entry 0, role GrantSource); the kernel wakes the live waiter with a real LAPIC dispatch routed through the device-interrupt dispatch slot plus the deferred-EOI and waiter path – the same grant-source delivery model proven by make run-interrupt-grant and used by the virtio-net provider. The wait returns result=interrupt-delivered real_interrupt_delivery=delivered wake_blocked=false with the route’s dispatch delivery_count incremented. This is a kernel-injected dispatch at the route’s programmed LAPIC vector, not a device-autonomous MSI-X raise: the NVMe MSI-X table is not yet hardware- programmed for an external write (msix_table_programming=not-written, as on the DDF interrupt-grant path), so device-raised MSI-X delivery and MSI-X table programming remain a documented next increment.
  • Completion consume: only after the interrupt-driven wake does the provider read completion queue entry 0 in its mapped admin CQ page (NVMe Base Spec §4.6): a 16-byte entry whose DW3 (bytes 12:15) carries the command id (bits 15:0), the phase tag (bit 16), and the status field (bits 31:17). It checks the status field is success and the command id matches, confirms the controller DMA-wrote the data structure (non-zero PCI Vendor ID at IDENTIFY byte 0; QEMU’s nvme reports 0x1b36), then advances the CQ head doorbell. The completion is thus consumed after an interrupt-driven wake, with the mapped-CQ read as the consume step.
  • Stale/post-reset no-wake: after the IDENTIFY completes, a second live Interrupt.wait waiter is installed on the (driver-unmasked) route, observed to stay pending, and the route is then masked; the live waiter completes result=interrupt-waiter-cancelled reason=route-masked rather than woken (masked_live_waiter_woke=false). At the kernel layer the stale/unvalidated/retired-generation completion no-wake invariant (completion_wakes_waiter) remains proven by the cfg(qemu) self-test (§3, waiter_wake=none).
  • Fail-closed before submit: a SQ tail doorbell with the admin SQ/CQ or data page unallocated, freed, or in-flight returns nvme-admin-command-not-armed (devicemmio-nvme-admin-submit-not-armed) with no MMIO side effect; an out-of-range doorbell index returns devicemmio-nvme-admin-doorbell-out-of-range. Teardown frees the data page first and proves the post-free re-submit fails closed.
  • Proof lines (asserted by tools/qemu-pci-nvme-smoke.sh): nvme: admin-submit owner=nvme-storage admin_sq_slot=0 admin_data_slot=2 validator=sq-tail-doorbell scanned_items=1 command=identify-controller prp1_authored=true ... doorbell_written=performed host_physical_user_visible=false (kernel), [nvme-bringup-smoke] admin-interrupt-route-unmasked ... route_state_after=driver-unmasked, [nvme-bringup-smoke] admin-interrupt-wake result=interrupt-delivered real_interrupt_delivery=delivered wake_blocked=false ... interrupt_driven_wake=delivered (userspace, the interrupt-driven wake), [nvme-bringup-smoke] identify-complete ok command=identify-controller cid=0x0042 status=0x0000 phase=1 ... identify_vid=0x... completion_consumed= mapped-admin-cq-after-interrupt-wake ... (userspace), nvme: admin-complete-ack ... cq_head=1 ... address_bearing=false, [nvme-bringup-smoke] identify-cq-head-advanced ..., [nvme-bringup-smoke] admin-interrupt-stale-no-wake ... result=interrupt-waiter-cancelled ... masked_live_waiter_woke=false, and [nvme-bringup-smoke] stale-submit-refused ... reason=nvme-admin-command-not-armed (userspace).
  • Not in scope (this path): I/O queue pairs, read/write commands, and the remaining out-of-scope items below are covered in §8.

8. Brokered I/O queue pair + bounded READ (no-IOMMU)

nvme-io-queue-and-read extends the brokered no-IOMMU lane to one I/O queue pair and one bounded read – the last piece of the userspace NVMe storage-provider foundation. After the §7 IDENTIFY, the provider creates one I/O queue pair (queue id 1) through admin commands, then issues one READ on it. As on every brokered path the manager authors each command’s address-bearing PRP1 from a live ledger page; the provider supplies only the non-addressing dwords and the doorbell index.

  • I/O queue entry sizes: the re-enable (§6) must program CC.IOSQES (bits 19:16, log2 of the 64 B SQ entry = 6) and CC.IOCQES (bits 23:20, log2 of the 16 B CQ entry = 4) before any I/O queue is created (NVMe Base Spec §3.1.5); a CC.EN 1->0 reset clears all of CC, so the provider sets them explicitly (resulting CC = 0x00460001). Creating an I/O queue with IOCQES/IOSQES unset is refused by the controller (QEMU returns command-specific Invalid Queue Size).
  • Create I/O queue commands (wire subset): CREATE I/O COMPLETION QUEUE (opcode 0x05, NVMe Base Spec §5.3) and CREATE I/O SUBMISSION QUEUE (opcode 0x01, §5.4) are admin commands submitted on the admin SQ. CDW10 carries the zero-based queue size (bits 31:16) and queue id (bits 15:0); the create-CQ CDW11 sets PC=1 with IEN=0 (no I/O interrupt; the completion is polled), the create-SQ CDW11 sets the completion queue id (bits 31:16) and PC=1. PRP1 (the queue base) is left zero by the provider and authored by the manager from the I/O CQ (slot 3) / I/O SQ (slot 4) ledger page.
  • Opcode-directed PRP authoring: a write to the admin SQ tail doorbell is routed to nvme_brokered_admin_sq_doorbell (kernel/src/device_manager/mod.rs), which reads the opcode the provider wrote into the just-published SQE (at SQ index tail - 1) and maps it to the live ledger page whose device-visible address is that command’s PRP1: IDENTIFY -> IDENTIFY data (slot 2), CREATE I/O CQ -> I/O CQ (slot 3), CREATE I/O SQ -> I/O SQ (slot 4). An unrecognized opcode fails closed (devicemmio-nvme-admin-submit-unknown-opcode). The opcode is the only provider-supplied input consulted and is non-addressing.
  • READ command (wire subset): opcode 0x02 (NVM Command Set §3.x), NSID 1, starting LBA 0 in CDW10/11, NLB 0 (zero-based, one block) in CDW12. PRP1 (the data buffer) is authored by the manager from the read data page (slot 5).
  • I/O doorbells: the same NvmeControllerRegister claim covers the I/O SQ tail doorbell (0x1008) and I/O CQ head doorbell (0x100c) – queue id 1 with doorbell stride CAP.DSTRD=0 (SQ y tail at 0x1000 + (2y)*4, CQ y head at 0x1000 + (2y+1)*4). The I/O SQ tail doorbell is routed to nvme_brokered_io_sq_doorbell, which requires the opcode to be READ, materializes the READ PRP1 from the live read data page, validates it through the Model B on-notify validator (ScanKind::SqTailDoorbell), authors the SQE PRP, and rings the doorbell. The I/O CQ head doorbell (nvme_brokered_io_cq_head_doorbell) carries only the consumed-entry head index (no address-bearing field).
  • Completion consume: the create-queue and READ completions are consumed by polling the mapped CQ phase tags. The single kernel grant-source injected interrupt delivery is spent on the §7 admin IDENTIFY wake (delivery_count_before == 0 gates injection to one delivery per route), so an interrupt-driven I/O completion wake awaits the device-autonomous MSI-X table-programming increment that §7 already defers. The provider confirms the controller DMA-transferred real data by checking the harness-seeded LBA 0 signature (0x4f504143 = “CAPO”) in its mapped read data page – proving the read moved bytes through the brokered PRP, not a zero page.
  • Fail-closed before submit: an I/O SQ tail doorbell with the I/O CQ/SQ or read data page unallocated, freed, or in-flight returns nvme-io-command-not-armed (devicemmio-nvme-io-submit-not-armed) with no MMIO side effect; an out-of-range index returns devicemmio-nvme-io-doorbell-out-of-range; an opcode other than READ returns devicemmio-nvme-io-submit-unknown-opcode. Teardown frees the read data page first and proves the post-free I/O re-submit fails closed.
  • Proof lines (asserted by tools/qemu-pci-nvme-smoke.sh): nvme: admin-submit ... command=create-io-cq ... admin_data_slot=3 sq_tail=2, [nvme-bringup-smoke] create-io-cq-complete ok ..., nvme: admin-submit ... command=create-io-sq ... admin_data_slot=4 sq_tail=3, [nvme-bringup-smoke] create-io-sq-complete ok ..., nvme: io-submit owner=nvme-storage io_queue_id=1 io_sq_slot=4 io_read_data_slot=5 ... command=read ... io_sq_tail=1 doorbell_offset=0x1008 doorbell_written=performed host_physical_user_visible=false (kernel), [nvme-bringup-smoke] io-read-complete ok command=read cid=0x0053 status=0x0000 ... io_read_dword0=0x4f504143 ... completion_consumed=mapped-io-cq-polled (userspace, the read data proof), nvme: io-complete-ack ... io_cq_slot=3 cq_head=1 ... address_bearing=false, and [nvme-bringup-smoke] io-stale-submit-refused ... reason=nvme-io-command-not-armed (teardown).
  • Not in scope: device-autonomous MSI-X delivery (hardware MSI-X table programming, a device-raised I/O completion interrupt, and an interrupt-driven I/O completion wake), multi-block / write / scatter-gather (PRP-list) I/O, cloud (GCP/AWS/Azure) enumeration or evidence, and host-physical/IOVA export remain out of scope. hostile_hardware_isolation=not-claimed.

9. Production-path cloudboot proofs (non-qemu cloud kernel)

This section covers the non-qemu cloudboot kernel proofs. The older cloud-prod-storage-bound-local-proof storage-bind path predates the later production-stub NVMe manager operations: it binds DeviceMmio/DMAPool/Interrupt surfaces to one NVMe function and exercises an interrupt-dispatch proxy, but does not attempt controller enable, admin commands, I/O queues, IDENTIFY, READ, or a userspace storage provider.

  • Older storage-bind proxy: cap::storage_bind_proof::report (kernel/src/cap/storage_bind_proof.rs) runs under #[cfg(not(feature = "qemu"))] during kernel::run_init (kernel/src/main.rs). It selects an NVMe function with PciDevice::is_nvme_controller (kernel/src/pci.rs), stages a readback DeviceMmio record through device_manager::stage_bar_readback_region, a parked bounce DMAPool/DMABuffer through device_manager::stage_bounce_buffer_dmapool_record and device_manager::issue_manager_attached_dmabuffer_handle_with_request, and one MSI-X interrupt route through device_interrupt plus mask-first PCI MSI-X table programming. The I/O-completion evidence is a kernel-side proxy: device_interrupt::handle_lapic_delivery advances the live dispatch slot, deferred EOI is acknowledged, masked no-wake is checked, and teardown proves stale route/pool/buffer/MMIO handles fail closed. Its marker is cloudboot-evidence: storage-bound <token>, with summary fields such as nvme_admin_identify=not-attempted, nvme_read_command=not-attempted, and waiter_wake=kernel-side-proxy.
  • Later production-stub manager ops: the non-qemu cloudboot kernel also implements real production-stub NVMe operations for the same local QEMU/cloudboot lane. The read-only bind, reset-only CC.EN=0 selected-write claim, parked admin SQ/CQ/data DMABuffer materialization, brokered controller-enable manager operation (DeviceMmio.brokeredNvmeControllerEnable @6), and brokered admin IDENTIFY Controller manager operation (DeviceMmio.brokeredNvmeAdminIdentify @7) live in kernel/src/device_manager/stub.rs and the production grant-source modules. These operations are not the storage-bind proxy: the manager authors AQA/ASQ/ACQ, CC.EN=1, the fixed IDENTIFY SQE, PRP1, SQ tail doorbell, CQ polling, and CQ head doorbell from its parked ledger. The provider still supplies no host-physical address, IOVA, queue base, PRP/SGL, opcode, command id, doorbell offset, or doorbell value.
  • Local production provider chain: the moved cloud-prod-nvme-brokered-userspace-provider-local-proof parent is closed by production-stub child records over the non-qemu cloudboot kernel. The local QEMU/cloudboot chain reaches split admin completion (@8 / @9 plus Interrupt.wait / Interrupt.acknowledge), I/O queue creation (@10 / @11), bounded READ/WRITE (@12-@15), second-LBA/multiblock I/O (@16-@19), synchronous read/write and read-bytes (@20-@22), BlockDevice.readBlocks / writeBlocks / FLUSH, higher-level filesystem and Store consumers, a dedicated data-path completion Interrupt route, and multi-PRP BlockDevice windows. These are manager-authored brokered operations in kernel/src/device_manager/stub.rs and the production proof modules, not provider-authored Model B doorbell writes.
  • READ-arm graduation to always-built production (cloud-prod-nvme-storage-graduate-readarm-local-proof): the NVMe BlockDevice READ arm is the first capstone piece graduated OUT of the per-proof cloud_nvme_*_proof features into always-built production code. The BlockDeviceBackend::NvmeBrokered arm and its arbitrary-window readBlocks @0 body (kernel/src/cap/block_device.rs, cfg(not(qemu))), the shared read body nvme_brokered_io_sync_command / nvme_brokered_io_sync_read_window_op_for_cap and the brokered controller bring-up registers/helpers it reaches (kernel/src/device_manager/stub.rs), and live_handle_for_nvme_blockdevice now compile in the default no-proof cargo build / make capos-cloudboot-image kernel – the GCE-validated production composition. ACTIVATION is fronted by a fail-closed runtime capability probe kernel/src/nvme_storage_backend.rs (dma_backend.rs-style atomic verdict + select_nvme_blockdevice_handle() resolver): the cap is minted only when a staged device_mmio grant resolves a live brokered-controller handle (recording the verdict), else a typed error is returned – never a panic. The no-NVMe default boot leaves the probe unverified, so the block_device grant fails closed. writeBlocks / flush stay fail-closed on the graduated arm (named follow-up graduations). The graduated data plane is bounded, not a general-purpose driver: every command runs through the synchronous single-call seam (kernel/src/device_manager/nvme_sync_io_state.rs), which admits at most 64 single-call I/O commands per boot (MAX_SYNC_OPS) and permanently rejects further commands at the first I/O CQ wrap (no CQ phase-toggle handling) – both limits fail closed. Namespace geometry is IDENTIFY-derived, not assumed: after the fixed three-command bring-up sequence completes, the first geometry consultation issues one manager-authored IDENTIFY Namespace (CNS 0x00, NSID 1, admin SQ index 3 / tail 4, NVMe Base Spec §5.17) through nvme_namespace_geometry_for_cap (kernel/src/device_manager/stub.rs), parses NSZE plus the active LBA format (FLBAS + LBAF LBADS/MS), caches the verdict for the boot, and emits nvme: brokered-identify-namespace ... nsze=... flbas=... lbads=... supported=.... BlockDevice.info @2 and the readonly_fs/persistent_store/writable_fs NVMe BlockSource::info report this IDENTIFY-derived geometry, and every read/write window bound is enforced against it; while the claim is unavailable (bring-up incomplete, a failed claim reset the controller, or an unsupported format – anything other than 512 B data blocks with no interleaved metadata) those paths fail closed instead of falling back to a fixture constant. Proof: make run-cloud-provider-nvme-blockdevice-read-graduated emits cloudboot-evidence: provider-nvme-blockdevice-read-graduated <token> (read_arm=always-built data_plane_feature_gated=false probe_verdict=verified nvme_read_roundtrip_match=true). This is a local QEMU/cloudboot proof; it does NOT claim a live cloud NVMe run, direct DMA, IOVA export, or a write/durability graduation.
  • Production boundary: one production-stub NVMe path now has live GCE Persistent Disk evidence: provider-nvme-io-read completed one brokered 512-byte READ on run 1780806087-bf69. Other production-stub NVMe proofs remain local QEMU/cloudboot evidence unless their task record explicitly says otherwise. The current evidence still does not claim direct DMA, cloud/guest IOMMU support, provider-visible device addresses, device-autonomous MSI-X delivery, AWS/Azure storage, a reusable storage provider, or full filesystem integration. The NVMe BlockDevice READ data plane is graduated to always-built production (above); other write/FLUSH/filesystem consumers and broader windows have local proof coverage but remain bounded by their recorded production proof gates unless their task record explicitly says the surface was graduated.