NVMe (NVM Express controller)
This is a provenance map for the NVMe controller wire subset the kernel Model B on-notify DMA validator scans on the doorbell/queue-arm path. It cites the spec basis, summarizes only the register and descriptor fields the validator actually reads, and points into the implementation by symbol name. It is not a re-spec.
Maturity caveat. This page documents the DMA validator mechanism, a
brokered no-IOMMU bring-up through one bounded I/O read on the local QEMU
make run-pci-nvme gate, and one bounded live-GCE Persistent Disk proof for the
production provider-nvme-io-read path. It is still not a general
production NVMe driver, not broad GCP/AWS/Azure storage readiness, and not a
provider-visible address or direct-DMA claim. It also records the 2026-05-27
correction: on the current no-IOMMU gate, provider-written queue-base or PRP
addresses would be host physical addresses, so the live no-IOMMU path must be
brokered by the kernel/device manager unless a verified IOMMU/vIOMMU or
synthetic address namespace is added. The capabilities implemented against
make run-pci-nvme and the later production cloudboot gates:
-
nvme-doorbell-dma-validatoris the kernel on-notify scan (kernel/src/cap/nvme_doorbell_validator.rs); it proves its invariants with acfg(qemu)self-test (prove_qemu_on_notify_scan_contract) using synthetic owner windows in place of a live grant ledger. -
nvme-bind-claimed-mmio-readadds the read-only userspace bind (§4): the kernel claims the enumerated controller, preseeds its BAR0 controller-register page, and stages theDMAPool/DeviceMmio/Interruptbootstrap grant sources against it, and the userspacenvme-bringup-smokeprovider readsCAP/VS/CC/CSTSthrough the brokered claim, proving the claim reaches a coherent NVMe BAR0 (liveCAP, validVSversion). The controller is firmware-initialized under SeaBIOS NVMe boot-probe (CC.EN=1,CSTS.RDY=1), so the provider reports the observed enable/ready state rather than asserting reset. -
nvme-controller-reset-selected-writeadds the userspace controller reset (§5): theDeviceMmiogrant now carries a reset-only NVMe controller-register selected-write claim scoped toCC, and the provider drives the firmware-enabled controller to a known reset state (CC.EN=0→CSTS.RDY=0). This is the first genuine userspace NVMe controller-register write. -
nvme-no-iommu-brokered-controller-enableadds the brokered no-IOMMU enable (§6): the kernel authorsAQA/ASQ/ACQfrom the live DMA ledger and performs theCC.ENwrite, reachingCSTS.RDY=1without exposing a queue-base address. -
nvme-admin-queue-identify+nvme-admin-interrupt-deliveryadd the brokered admin SQ/CQ doorbell and one interrupt-drivenIDENTIFY(§7). -
nvme-io-queue-and-readadds the brokered I/O queue pair and one boundedREAD(§8) – the last piece of the userspace storage-provider foundation. -
cloud-prod-nvme-userspace-provider-readonly-bind-local-proofports the same-BDF read-only bind shape onto the non-qemucloudboot kernel under thecloud_nvme_readonly_bind_proofCargo feature. The feature constrains the three production grant sources (devicemmio_grant_source_prod,dmapool_grant_source_prod,interrupt_grant_source_prod) to the NVMe function (class0x01subclass0x08); the userspacecloud-nvme-readonly-bind-smokeprovider receives a same-BDFDeviceMmio/DMAPool/Interruptbundle, readsCAP_LO/CAP_HI/VS/CC/CSTSvia brokeredDeviceMmio.read32, releases the three caps, and asserts stale-handle rejection on each. Proof:make run-cloud-provider-nvme-readonly-bind. NoCC.ENwrite, no admin or I/O queue, noIDENTIFY, noInterrupt.wait, no DMA, no live cloud. -
cloud-prod-nvme-controller-reset-selected-write-local-prooflayers the reset-onlyCCselected-write authority on the same-BDF bundle under thecloud_nvme_controller_reset_proofCargo feature (which impliescloud_nvme_readonly_bind_proof, so the picker constraints are inherited). The kernel admits exactly one brokeredDeviceMmio.write32shape throughkernel::device_manager::stub::write_devicemmio_u32: aCCwrite (offset0x14) whoseCC.EN(bit 0) is cleared. ACCwrite that setsCC.ENfails closed withdevicemmio-nvme-cc-enable-deferred, a write to any non-CCoffset fails closed withdevicemmio-write32-register-unclaimed, and an out-of-range or unaligned offset fails closed at the range validator, all before any volatile MMIO write touches the BAR. The userspacecloud-nvme-controller-reset-smokeprovider receives the same-BDF bundle, readsCAP_LO/CAP_HI/VS/CC/CSTS, exercises the two fail-closed write probes (CC.EN=1, non-CCoffset0x18), performs the admittedCC.EN=0reset write, pollsCSTSuntilCSTS.RDY=0, re-readsCCto assertCC.EN=0, releases the three caps, and confirms stale-handle rejection. Proof:make run-cloud-provider-nvme-controller-reset. NoCC.EN=1write, no admin or I/O queue, noIDENTIFY, noInterrupt.wait, no DMA, no live cloud. -
cloud-prod-nvme-admin-queue-materialization-local-proofmaterializes the admin SQ and admin CQ backing buffers on the same-BDF bundle under thecloud_nvme_admin_queue_materialization_proofCargo feature (which impliescloud_nvme_controller_reset_proof, which impliescloud_nvme_readonly_bind_proof, so the picker constraints and the reset-onlyCC.EN=0claim are inherited). No new kernel admission surface is added: the productiondevice_manager::stubalready supports manager-owned bounce-buffer allocation throughstage_bounce_buffer_dmapool_record+issue_manager_attached_dmabuffer_handle_with_request(a fresh zeroed-on-alloc kernel frame per buffer), scrub-before-frame-free ondetach_dmabuffer_record_for_cap_release, and stale-handle rejection on the parked-slot ledger. The userspacecloud-nvme-admin-queue-materialization-smokeprovider receives the same-BDF bundle, sequentially materializes the admin SQ backing buffer and the admin CQ backing buffer through the brokeredDMAPool.allocateBuffer+DMABuffer.{info,map,unmap,freeBuffer}path (assertinguserspace_dma_buffer=manager-issued-bounce-buffer,iova_export=disabled-future-only,host_physical_user_visible=false, anddevice_iova=0on each), writes and reads back a deterministic 256-byte template through the userspace VMA, asserts the freshly-allocated admin CQ frame reads back as zero before the write (scrub-before-reuse, paired with the admin SQ’s scrub-before-frame-free), confirms post-freeDMABuffer.mapfail-closed on the stale handle, emits onecloudboot-evidence: provider-nvme-admin-queue-materialization <token>marker recording both manager-owned pool/buffer slot/generation identities and the discipline labels, releases the three bundle caps, and confirms stale-handle rejection on each. Proof:make run-cloud-provider-nvme-admin-queue-materialization. No NVMe controller register WRITE on this path (the kernel still admits the reset-onlyCC.EN=0claim from the controller-reset sibling, but this smoke never callsDeviceMmio.write32), noAQA/ASQ/ACQpublication, noCC.EN=1, no I/O queue allocation, noIDENTIFY, no PRP/SGL publication, no doorbell write, noInterrupt.wait/Interrupt.acknowledge, no host-physical or IOVA export, no live cloud. -
cloud-prod-nvme-brokered-controller-enable-local-proofenables the controller through manager-authoredAQA/ASQ/ACQplus a provider-suppliedCC.EN=1write under thecloud_nvme_controller_enable_proofCargo feature (which implies the three earlier features). The productiondevice_manager::stubparked-pool slot holds two simultaneously-live bounce-bufferDMABuffers (PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 2,PARKED_DMABUFFER_SLOTS = [1, 2]) so the admin SQ and admin CQ can stay parked together; the bounce-buffer grant proof and the virtio-net live-publish proof keep their existing single-buffer behavior (slot 0). The provider-suppliedCC.EN=1write of this path is superseded bycloud-prod-nvme-controller-enable-manager-op-remediationbelow, which makes rawDeviceMmio.write32(CC, value with CC.EN=1)fail closed before any MMIO side effect and exposes controller enable only through the no-parameterDeviceMmio.brokeredNvmeControllerEnableverb (schema@6). The parked-pool slot capacity, the[1, 2]slot ids, theAQAdepth policy, and the four MMIO writes the manager authors all carry over unchanged. -
cloud-prod-nvme-controller-enable-manager-op-remediationcorrects the brokered enable contract. RawDeviceMmio.write32(CC, value with CC.EN=1)now fails closed withauthority_result=devicemmio-nvme-cc-enable-raw-blocked / authority_reason=cc-enable-requires-broker-nvme-controller-enable-opbefore any volatile MMIO side effect. Controller enable is reachable only through the new no-parameterDeviceMmio.brokeredNvmeControllerEnableverb (schema@6), which carries nooffset,value, queue address, queue id, PRP/SGL, or provider-selected controller-bit parameter. The verb routes to the renamed manager-authorednvme_brokered_controller_enable_op_for_capinkernel/src/device_manager/stub.rs. The manager: (1) validates the cap’s BAR matches the parked region and covers theCC/AQA/ASQ/ACQregister span; (2) resolves the two parked admin queue DMABuffers (slot order: SQ then CQ) and requires both to be live, unmapped, and frame-aligned; (3) selects every controller bit internally –CC.EN | IOSQES=6 | IOCQES=4(NVMe Base Spec §3.1.5); (4) authorsAQA = ((depth-1)<<16) | (depth-1)with depth8,ASQlow/high from the admin SQ buffer’s phys address,ACQlow/high from the admin CQ buffer’s phys address through the boot-preseeded BAR0 kernel mapping; and (5) performs the manager-selectedCC.EN=1write. The provider supplies no parameters and never observes a host-physical / device-visible queue-base address. The cap dispatch admission carriesauthority_result=ok,register_write=performed,side_effect=mmio-write-performed,cc_en_write_performed=true,aqa_authored=true,asq_authored=true,acq_authored=true, andqueue_base_source=manager-ledger. The kernel diagnostic line is nownvme: brokered-enable owner=cloud-nvme model=cloud-bounce validator=none trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 aqa=0x00070007 cc=0x... asq_authored=true acq_authored=true cc_en_write=performed cc_bits_selected_by=manager queue_base_source=manager-ledger host_physical_user_visible=false proof_result=ok;trigger=manager-opproves the admission entered through@6, not through a rawCCwrite32. Thecloud-nvme-controller-enable-smokeprovider proves both the new fail-closed rawCC.EN=1probe and the manager-op enable, and the headlinecloudboot-evidence: provider-nvme-controller-enable <token>marker pinsbrokered_enable_trigger=manager-opandcc_raw_enable_write=refused. Proof:make run-cloud-provider-nvme-controller-enable. NoIDENTIFY, admin or I/O queue command, PRP/SGL publication, doorbell write,Interrupt.wait/Interrupt.acknowledge, host-physical or IOVA export, or live cloud is claimed. -
cloud-prod-nvme-admin-identify-manager-op-local-proofextends the corrected controller-enable surface with one explicit manager-owned admin-command operation:DeviceMmio.brokeredNvmeAdminIdentify(schema@7). The verb carries no parameters; the cap holder may not supply queue addresses, opcode, command id, NSID, PRP/SGL entries, data-buffer address, doorbell offset, or doorbell value. The productiondevice_manager::stubparked-pool slot capacity was extended from two to three simultaneously-live bounce-buffer DMABuffers (PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 3,PARKED_DMABUFFER_SLOTS = [1, 2, 3]) so the admin SQ (slot 1), admin CQ (slot 2), and IDENTIFY data page (slot 3) can stay parked together; the controller-enable sibling, the bounce-buffer grant proof, and the virtio-net live-publish proof keep their existing single- or dual-buffer behavior unchanged. The production grant source’s kernel-mapped BAR window was correspondingly widened from one to two pages (MAPPED_WINDOW_BYTES = 0x2000undercloud_nvme_admin_identify_proof) so the admin SQ tail (0x1000) and admin CQ head (0x1004) doorbells fall inside the boot-preseeded mapping the manager already uses forCC/AQA/ASQ/ACQ– rawwrite32to either doorbell offset still fails closed at the device-manager boundary asdevicemmio-write32-register-unclaimed(the offset is outside the reset-onlyCCselected-write claim), and the brokered admin IDENTIFY verb is the only path that may ring them. The handlernvme_brokered_admin_identify_op_for_capinkernel/src/device_manager/stub.rs: (1) validates the cap’s BAR matches the parked region and covers both doorbell offsets and theCSTSregister; (2) resolves the three parked admin DMABuffers and requires all three to be live, unmapped, and frame-aligned; (3) re-readsCSTSthrough the boot-preseeded BAR mapping and refuses ifCSTS.RDY=0; (4) authors the full submission queue entry at admin SQ index 0 through the HHDM kernel mapping of the SQ page – opcodeIDENTIFY(0x06, NVMe Base Spec §5.17), command id 1, NSID 0, MPTR 0, PRP1 = data-page physical address (sourced from the manager’s parked-pool ledger), PRP2 0, CDW10 CNS0x01(Controller); (5) issues a SeqCst fence and rings the admin SQ tail doorbell at BAR0 offset0x1000; (6) polls the admin CQ entry at index 0 through the HHDM kernel mapping of the CQ page for the phase-bit flip (NVMe Base Spec §4.6 CQE DW3 bit 16); (7) inspects the CQE status field (bits 30:17 of DW3) and command-id echo, refusing on either mismatch; (8) parses IDENTIFY Controller VID (offset 0, 2 bytes) and SSVID (offset 2, 2 bytes) through the HHDM kernel mapping of the data page; (9) advances the admin CQ head doorbell at BAR0 offset0x1004. The provider sees only bounded-status labels, the manager-selected CNS/opcode/command-id echoes, the three parked-slot identities, the parsed VID/SSVID, and the doorbell side-effect labels. The cap-side dispatch admission carriesauthority_result=ok,result=ok,register_write=performed,side_effect=mmio-write-performed,sq_doorbell_written=true,cq_doorbell_written=true,completion_consumed=true,cq_status=0x0000,prp_source=manager-ledger, andhost_physical_user_visible=false. The kernel diagnostic isnvme: brokered-admin-identify owner=cloud-nvme model=cloud-bounce trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 admin_data_slot=3 cns=0x01 opcode=0x06 command_id=0x0001 ... cqe_status=0x0000 cqe_command_id=0x0001 sq_tail=1 cq_head=1 cq_phase=1 identify_vid=0x1b36 identify_ssvid=0x1af4 sq_doorbell_written=performed cq_doorbell_written=performed completion_consumed=true prp_source=manager-ledger host_physical_user_visible=false proof_result=ok(QEMU’snvmedevice reports PCI VID0x1b36and SSVID0x1af4, which the harness pins). Thecloud-nvme-admin-identify-smokeprovider exercises the inherited fail-closed raw-write claims (six in total: AQA/ASQ/ACQ + rawCC.EN=1+ raw admin SQ tail/CQ head doorbells), invokes the controller-enable verb at@6, invokes the admin IDENTIFY verb at@7, and emits one headlinecloudboot-evidence: provider-nvme-admin-identify <token>marker plus three supplementary[cloud-nvme-admin-identify-smoke] discipline-*lines that re-anchor the contract within the per-callConsole.writeLinebound. Proof:make run-cloud-provider-nvme-admin-identify. Future work (not yet implemented): I/O queue creation,READ/WRITE,Interrupt.wait/Interrupt.acknowledgeadmin-completion handoff, device-autonomous MSI-X delivery, host-physical/IOVA export, provider-authored SQE/PRP/SGL bytes, provider-authored doorbell offsets/values, and live cloud traffic. -
cloud-prod-nvme-admin-completion-wait-ack-local-proofmoves the admin IDENTIFY completion handoff off manager-internal CQ polling onto the productionInterrupt.wait/Interrupt.acknowledgepath. The admission-check-only productionInterruptgrant (interrupt_grant_source_prod,wait/acknowledge=admission-check-only) is replaced by a fully-programmed cap-waiter MSI-X route on the same NVMe BDF (kernel/src/cap/nvme_admin_completion_wait_ack_proof.rs, table entry 0): itsinitregisters + claims the route underManagerGrantSource, programs the MSI-X table entry mask-first with the kernel-authored(message_address, message_data), attaches it to the device manager, arms the deferred-LAPIC-EOI gate, and unmasks. The admin IDENTIFY is split into two manager-owned verbs:DeviceMmio.brokeredNvmeAdminSubmit(schema@8) authors the SQE and rings the admin SQ tail doorbell (no CQ consumed,completion_consumed=false), andDeviceMmio.brokeredNvmeAdminComplete(schema@9) polls/consumes the admin CQ (the manager-owned CQ status/CID check is preserved), parses VID/SSVID, and advances the admin CQ head doorbell. Both reuse the sharedNvmeBrokeredAdminOpResultschema struct. The handoff state machine is ordered and one-shot:brokeredNvmeAdminSubmit(@8) records the exact live admin SQ/CQ/data slots and generations;Interrupt.waitis admitted once for that submitted state, revalidates those live DMA records, and consumes the wait phase;brokeredNvmeAdminComplete(@9) is admitted only after the wait phase; andInterrupt.acknowledgeis admitted once only after the completion phase has been recorded. Hostile complete-before-wait, ack-before-complete, repeat wait, repeat complete, repeat ack, and submit-then-DMABuffer.freeBufferattempts fail closed before injecting extra dispatch, retiring an extra EOI, or freeing/scrubbing the manager-owned admin pages. Between the two verbs the provider callsInterrupt.wait– which injects exactly one bounded, non-autonomousdevice_interrupt::handle_lapic_deliverydispatch on the bound route (result=nvme-admin-completion-wait-ack-dispatch-consumed,real_interrupt_delivery=kernel-injected-dispatch, delivery count+1, one deferred LAPIC EOI armed) – then after the completion verb callsInterrupt.acknowledgeto retire exactly that deferred EOI (hardware_dispatch_ack_delta=1). The chain is: read-only bind -> reset-onlyCC.EN=0-> manager-owned admin buffer materialization ->brokeredNvmeControllerEnable->brokeredNvmeAdminSubmit(@8) -> admin SQ tail doorbell ->Interrupt.waitwake ->brokeredNvmeAdminComplete(@9) -> admin CQ completion consumed -> admin CQ head doorbell advanced ->Interrupt.acknowledgedeferred EOI retired. OnInterruptcap release the kernel requires exactly one observed dispatch, exactly one observed ack, and the terminal acked handoff state, then runs the masked-no-wake + reassign + stale-handle/stale-token assertion chain and emits exactly one headlinecloudboot-evidence: provider-nvme-admin-completion-wait-ack <token>marker labeledadmin_completion_wake=provider-cap-side-injected device_autonomous_raise=not-claimed. The wake is the same bounded kernel-injected cap-waiter model asmake run-cloud-provider-cap-waiter; this proof does not claim a device-autonomously-raised NVMe MSI-X completion interrupt. Proof:make run-cloud-provider-nvme-admin-completion-wait-ack. Future work (not yet implemented): I/O queue creation,READ/WRITE,BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-queue-create-local-proofadds the single I/O queue pair (queue id 1) on top of the admin chain. After the combined poll-based admin IDENTIFY (DeviceMmio.brokeredNvmeAdminIdentify @7, VID0x1b36), the manager authors the two queue-establishing admin commands behind parameterless per-command verbs:DeviceMmio.brokeredNvmeCreateIoCqSubmit(schema@10, opcode0x05, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 PC=1 IEN=0, PRP1 = manager-owned I/O CQ base page) andDeviceMmio.brokeredNvmeCreateIoSqSubmit(schema@11, opcode0x01, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 = CQ id 1 | PC<<16, PRP1 = manager-owned I/O SQ base page). The opcode/CDWs are manager-selected; the provider supplies nothing (widening@8with a command-selector parameter was rejected because it would let a provider author arbitrary admin opcodes). Each SUBMIT verb authors the SQE at the next admin SQ index, rings the admin SQ tail doorbell, and records the in-flight create; the completion of each is consumed through the sharedDeviceMmio.brokeredNvmeAdminComplete(@9, now command-aware: it reads the admin CQ entry at the recorded index, checks status/CID, and advances the CQ head doorbell) after one provider-cap-sideInterrupt.wait, and the deferred LAPIC EOI is retired by oneInterrupt.acknowledge. The cap-waiter route (kernel/src/cap/nvme_io_queue_create_proof.rs) drives two bounded kernel-injected dispatch + deferred-EOI cycles – one per create – and its ordered handoff enforces CREATE I/O CQ before CREATE I/O SQ, one create at a time, with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. The I/O CQ/SQ base pages are manager-owned brokered bounce buffers (parked-pool slots 3/4, userspace slots 4/5); their PRP1 is never exported. OnInterruptcap release the kernel requires both creates completed (CQE status 0), exactly two observed dispatches, two observed acks, the idle terminal handoff, and the masked-no-wake + reassign + stale-handle chain, then emits onecloudboot-evidence: provider-nvme-io-queue-create <token>marker labeledio_queue_create_wake=provider-cap-side-injected device_autonomous_raise=not-claimed io_command=create-only io_read=not-attempted io_sq_doorbell=not-attempted. Proof:make run-cloud-provider-nvme-io-queue-create. Future work (not yet implemented): the I/O SQ tail doorbell (0x1008),READ/WRITE, the I/O data page, the I/O-completion route,BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-read-local-proofadds one bounded I/OREAD(LBA 0, 1 block, NSID 1) on top of the live I/O queue pair. After the two CREATE I/O queue commands, the manager authors the entire READ SQE behind two parameterless per-command verbs:DeviceMmio.brokeredNvmeIoReadSubmit(schema@12) writes CDW0 (opcode0x02| command-id<<16), NSID 1, MPTR 0, PRP1 = manager-owned I/O read-data page (parked-pool slot 5), PRP2 0, SLBA 0 (CDW10/CDW11), NLB 0 = “1 block” (CDW12 bits 15:0) at I/O SQ index 0 and rings the I/O SQ tail doorbell (0x1008);DeviceMmio.brokeredNvmeIoReadComplete(schema@13) polls the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the I/O CQ head doorbell (0x100c), reads the first bytes of the read-data page through the kernel mapping, and surfaces a bounded read-data digest (readDataDigestLo/readDataDigestHi= first 8 bytes,readDataLen= transferred length). The provider supplies no opcode/LBA/PRP/ doorbell (a providerwrite32(0x1008, ...)path was rejected because it would break the no-provider-authored-command discipline; reusing the create/admin verbs was rejected because they are hardwired to the admin SQ/CQ doorbells and ledger). The cap-waiter route (kernel/src/cap/nvme_io_read_proof.rs) drives three bounded kernel-injected dispatch + deferred-EOI cycles – two creates plus one read – with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed; the read-data page is a manager-owned brokered bounce buffer (parked-pool slot 5, userspace slot 6) whose PRP1 is never exported, and the manager reads the completed block bytes only through the kernel mapping. OnInterruptcap release the kernel requires both creates and the read completed (CQE status 0), the verified block bytes (readDataLen > 0and a non-zero digest), exactly three observed dispatches, three observed acks, the idle terminal handoffs, and the masked-no-wake + reassign + stale-handle chain, then emits onecloudboot-evidence: provider-nvme-io-read <token>marker labeledio_read_wake=provider-cap-side-injected device_autonomous_raise=not-claimed io_command=read io_read=completed io_sq_doorbell=performed io_cq_completion=polled-io-cqplusio_read_block_bytes=<digest> read_data_len=512. The local QEMU smoke seeds the NVMe backing file’s first sector with a known 16-byte pattern so the digest proves an actual byte transfer, not merely that a CQE arrived. Proof:make run-cloud-provider-nvme-io-read. The same marker shape also passed live on GCE run1780806087-bf69(make cloudboot-gcp-storage-nvme-io-read-test) against a Persistent Disk NVMe controller withvendor.1ae0,device.001f,live_cloud=gce-persistent-disk, and a 512-byte READ digest prefixeb3c904c494d494e4520200002000000. Future work (not yet implemented): a dedicated I/O-completionInterruptroute on live cloud,WRITE, multi-block/second-LBA reads,BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud coverage beyond this one GCE PD read. -
cloud-prod-nvme-io-write-local-proofadds one bounded I/OWRITE(LBA 0, 1 block, NSID 1) of a fixed manager pattern on top of the live I/O queue pair, proven durable by reading it back. After the two CREATE I/O queue commands, the manager authors the entire WRITE SQE behind two parameterless per-command verbs:DeviceMmio.brokeredNvmeIoWriteSubmit(schema@14) pre-fills the manager-owned I/O write-data page (parked-pool slot 6, userspace slot 7) with the fixed 16-byte signaturefacefeedcafebabe1122334455667788repeated across the block, then writes CDW0 (opcode0x01| command-id<<16), NSID 1, MPTR 0, PRP1 = that page, PRP2 0, SLBA 0, NLB 0 = “1 block” at I/O SQ index 0 and rings the I/O SQ tail doorbell (0x1008);DeviceMmio.brokeredNvmeIoWriteComplete(schema@15) polls the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the I/O CQ head doorbell (0x100c), reads the first bytes of the write-data page through the kernel mapping, and surfaces a bounded written-pattern digest (carried in the sharedreadDataDigestLo/readDataDigestHi/readDataLenfields). The landed I/OREAD(@12/@13) is then reused unchanged to read LBA 0 back into the read-data page (slot 5) at the next I/O SQ/CQ index (1, since the WRITE consumed index 0); the provider supplies no opcode/LBA/PRP/pattern/doorbell, and a new schema field was deliberately avoided – the durability match is computed kernel-side by comparing the written-pattern digest with the read-back digest. The cap-waiter route (kernel/src/cap/nvme_io_write_proof.rs) drives four bounded kernel-injected dispatch + deferred-EOI cycles – two creates, one write, one read-back – with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed; the write-data and read-data pages are manager-owned brokered bounce buffers whose PRP1 is never exported. OnInterruptcap release the kernel requires both creates, the write, and the read-back completed (CQE status 0), non-zero digests, exactly four observed dispatches and acks, the idle terminal handoffs, the masked-no-wake + reassign + stale-handle chain, and the read-back digest matching the written pattern, then emits onecloudboot-evidence: provider-nvme-io-write <token>marker labeledio_command=write io_write=completed io_sq_doorbell=performed io_cq_completion=polled-io-cq write_pattern=<digest> write_readback_match=true. The local QEMU smoke seeds the backing file’s first sector with a distinct sentinel so the read-back of the manager pattern proves the WRITE transferred the bytes. Proof:make run-cloud-provider-nvme-io-write. Future work (not yet implemented): a dedicated I/O-completionInterruptroute distinct from the admin/create/read route, multi-block/second-LBA/second-NSID I/O, flush/FUA/DSM,BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-second-lba-local-proofgeneralizes the manager-authored data path beyond the hardwiredSLBA 0: it proves LBA addressing actually selects the block by driving three sequential I/O commands on the live queue pair through two new parameterless verbs (DeviceMmio.brokeredNvmeIoSecondLbaSubmit @16/brokeredNvmeIoSecondLbaComplete @17), selected by a kernel-owned phase counter: phase 0 reads LBA 0 (the distinctness baseline, read-data slot 5), phase 1 pre-fills the write-data page (slot 6) with a fixed LBA-1-distinct 16-byte pattern0123456789abcdeffedcba9876543210and writes it to LBA 1 (opcode0x01, CDW10 = SLBA low = 1, CDW11 = SLBA high = 0), and phase 2 reads LBA 1 back. Because the landed@12-@15verbs hardwire SLBA 0 (and their I/O index depends on the io-write feature), this proof implies onlycloud_nvme_io_queue_create_proofand authors its own SQEs at I/O SQ indices 0/1/2; the provider supplies no opcode/LBA/PRP/pattern/doorbell. No schema field was added – the LBA-1 read-back match (LBA-1 read digest == LBA-1 write pattern) and the LBA distinctness (LBA-1 read digest != LBA-0 read digest) are computed kernel-side across the three recorded phase digests. The cap-waiter route (kernel/src/cap/nvme_io_second_lba_proof.rs) drives five bounded kernel-injected dispatch + deferred-EOI cycles (two creates + three I/O phases), with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. OnInterruptcap release the kernel requires both creates and all three phases completed (CQE status 0), non-zero digests, five observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain,second_lba_readback_match=true, andlba_distinct_from_zero=true, then emits onecloudboot-evidence: provider-nvme-io-second-lba <token>marker labeledio_command=second-lba io_second_lba=1 second_lba_readback_match=true lba_distinct_from_zero=true io_sq_doorbell=performed io_cq_completion=polled-io-cq. The local QEMU smoke seeds the backing file’s first sector with the distinct sentineldeadbeefcafebabe0102030405060708so the LBA-0 read returns content distinct from the LBA-1 pattern. Proof:make run-cloud-provider-nvme-io-second-lba. Future work (not yet implemented): a dedicated I/O-completionInterruptroute, multi-block (NLB > 0) I/O, a third LBA or second namespace, flush/FUA/DSM,BlockDevice/filesystem integration (now unblocked by this LBA parameterization), device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-multiblock-local-proofgeneralizes the manager-authored data path beyond one logical block: it proves the authored SQE drives a transfer larger than a single block by driving two sequential I/O commands on the live queue pair through two new parameterless verbs (DeviceMmio.brokeredNvmeIoMultiblockSubmit @18/brokeredNvmeIoMultiblockComplete @19), selected by a kernel-owned phase counter: phase 0 pre-fills the write-data page (slot 6) with two distinct 16-byte patterns – block 0 =112233445566778899aabbccddeeff00, block 1 =f0e1d2c3b4a5968778695a4b3c2d1e0f– over 1024 B and writes both blocks to LBA 2 (opcode0x01, NLB = block_count - 1 = 1, CDW10 = 2; PRP1 = slot 6, PRP2 = 0, since the 1024 B transfer fits one 4 KiB page), and phase 1 reads LBA 2 back into the read-data page (slot 5). Because the landed@12-@17verbs hardwire a single block (block_count = 1), this proof implies onlycloud_nvme_io_queue_create_proofand authors its own SQEs at I/O SQ indices 0/1; the provider supplies no opcode/LBA/count/PRP/pattern/doorbell. No schema field was added for the second block’s digest – the existingreadDataDigestLo/Hicarry block 0’s first 8 bytes for userspace, while the per-block match (read block 0 == written pattern-0, read block 1 == written pattern-1) and the block-distinctness (block 0 digest != block 1 digest, proving the second 512 B block actually transferred) are computed kernel-side across the two recorded phase digest pairs and attested in the headline marker. The cap-waiter route (kernel/src/cap/nvme_io_multiblock_proof.rs) drives four bounded kernel-injected dispatch + deferred-EOI cycles (two creates + two I/O phases), with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. OnInterruptcap release the kernel requires both creates and both phases completed (CQE status 0), non-zero digests, four observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain,multiblock_block0_match=true,multiblock_block1_match=true, andmultiblock_blocks_distinct=true, then emits onecloudboot-evidence: provider-nvme-io-multiblock <token>marker labeledio_command=multiblock io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true multiblock_block0_match=true multiblock_block1_match=true io_sq_doorbell=performed io_cq_completion=polled-io-cq. Proof:make run-cloud-provider-nvme-io-multiblock. Future work (not yet implemented): a dedicated I/O-completionInterruptroute, NLB > 1 requiring a PRP list / second mapped page, a third LBA or second namespace, flush/FUA/DSM, wrapping the brokered READ/WRITE behind a userspace-servedBlockDevicecap (now has both LBA selection and >1-block transfer as prerequisites), device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-synchronous-poll-read-local-proofcollapses the four-call submit/wait/complete/ack NVMe I/O lifecycle into ONE synchronousCapObject::call– the shapeBlockDevice.readBlocks @0requires. It adds two parameterless single-call verbs (DeviceMmio.brokeredNvmeIoSyncWrite @20/brokeredNvmeIoSyncRead @21), each mirroring the combinedbrokeredNvmeAdminIdentify @7: the manager pre-fills the write-data page (slot 6) with112233445566778899aabbccddeeff00(block 0) andf0e1d2c3b4a5968778695a4b3c2d1e0f(block 1), authors the SQE (WRITE opcode0x01at I/O SQ index 0 / READ opcode0x02at index 1; NSID 1, SLBA 2, NLB = 1 / two 512 B blocks, PRP1 = data page, PRP2 = 0), rings the I/O SQ tail doorbell (0x1008), polls the I/O CQ entry phase bit to completion within a bounded budget, advances the I/O CQ head doorbell (0x100c), and reads block 0/block 1 back – all inside one cap call, with noInterrupt.waiton the I/O data path. The two CREATE I/O queue commands still complete through the cap-waiterInterrupt.wait/acknowledgepath, so the route (kernel/src/cap/nvme_io_sync_read_proof.rs) drives only two bounded kernel-injected dispatch + deferred-EOI cycles. The single-call verbs reportsqDoorbellWritten,cqDoorbellWritten, andcompletionConsumedall true in one result; the block-1 match and block-distinctness are computed kernel-side across the recorded WRITE/READ digest pairs. OnInterruptcap release the kernel requires both creates and both single-call I/O commands completed (CQE status 0), non-zero digests, two observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain,sync_block0_match=true,sync_block1_match=true, andsync_blocks_distinct=true, then emits onecloudboot-evidence: provider-nvme-io-sync-read <token>marker labeledio_command=sync-read io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true sync_block0_match=true sync_block1_match=true sync_blocks_distinct=true io_sq_doorbell=performed io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used. Proof:make run-cloud-provider-nvme-io-sync-read. This closes concern (c) of theBlockDevice-shaped read gap (lifecycle collapse) without touchingBlockDeviceCap/crate::virtio(concern a), the manager-op routing into a generic cap (concern b), a dedicated I/O-completionInterruptroute, or a PRP list (NLB > 1). Future work (not yet implemented): introducing an NVMe-backedBlockDeviceCapwhosereadBlocks @0arm calls this single-call op, areadonly_fs-style consumer over it, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-sync-read-block-bytes-local-proofsurfaces the full read-back bytes. It adds one verb,DeviceMmio.brokeredNvmeIoSyncReadBytes @22 () -> NvmeBrokeredAdminOpReadBytesResult, that reuses the landed single-call poll-read body (nvme_brokered_io_sync_command) unchanged but returns the entire 1024 B read-back (block 0 ‖ block 1), read through the kernel mapping, as the inlinedata :Datafield of a new narrow result struct – the full-bytes shapeBlockDevice.readBlocks @0 -> (data :Data)requires – instead of folding it to an 8-byte digest. The provider issues@20(WRITE) then@22(READ-bytes) as two synchronous cap calls and compares the returneddatabyte-for-byte to the reconstructed manager-authored page, asserting the two 512 B halves differ; the kernel still attests per-block match and distinctness in the release marker. No host-physical/IOVA address crosses the boundary – only the content bytes the caller already authored. The cap-waiter route (kernel/src/cap/nvme_io_sync_read_bytes_proof.rs) is a clone of the sync-read proof that emits onecloudboot-evidence: provider-nvme-io-sync-read-bytes <token>marker labeledio_command=sync-read-bytes io_slba=2 io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes data_block0_match=true data_block1_match=true data_blocks_distinct=true io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used. Proof:make run-cloud-provider-nvme-io-sync-read-bytes. This closes concern (b) of theBlockDevice-shaped read gap (full-bytes return) without touchingBlockDeviceCap/crate::virtio(concern a), arbitrary(startLba, count)parameterization, a dedicated I/O-completionInterruptroute, or a PRP list (NLB > 1). Future work (not yet implemented): an NVMe-backedBlockDeviceCapbackend enum whosereadBlocks @0arm calls this op, arbitrary-LBA routing, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-blockdevice-fixed-lba-read-arm-local-proofmakes the NVMe namespace consumable through the SAMEBlockDevice.readBlocks @0interface a filesystem consumer calls (proof-fixed LBA arm). It replacesBlockDeviceCap’s baredevice_index: usizewith aBlockDeviceBackendenum (kernel/src/cap/block_device.rs): the always-builtVirtio { device_index }arm (behavior-identical to today, verified bymake run-storage-fs) and, undercloud_nvme_blockdevice_read_proof, anNvmeBrokered { handle, owner }arm. TheNvmeBrokeredarm’sreadBlocks @0accepts ONLYstartLba == 2 && count == 2(fails closed withBlockDevice.readBlocks NVMe arm fixed to SLBA 2 NLB 1on any other window) and drives the landednvme_brokered_io_sync_read_bytes_op_for_cap(@22) body into a local 1024 B buffer surfaced as inlineDatathrough the sameread_blocks_resultsbuilder the virtio arm uses;writeBlocks/flushfail closed (read-only namespace) andinforeturns the fixed geometry. A new bootstrap grant arm mints theNvmeBrokeredcap bound to the SAME livedevice_mmiohandle/owner the production grant source staged (devicemmio_grant_source_prod::live_handle_for_nvme_blockdevice), so thedevice_mmiogrant must precede theblock_devicegrant in the manifest cap list. The provider drives the full bring-up (reset → enable → IDENTIFY @7 → CREATE I/O CQ @10 → CREATE I/O SQ @11 →@20WRITE) and issues the TERMINAL read throughBlockDevice.readBlocks(2, 2)instead of a rawDeviceMmio @22. The cap-waiter route (kernel/src/cap/nvme_blockdevice_read_proof.rs, a clone of the sync-read-bytes proof) emits onecloudboot-evidence: provider-nvme-blockdevice-read <token>marker labeledread_path=blockdevice-readblocks read_iface=BlockDevice read_method=0 io_slba=2 io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes nvme_arm_fixed_lba=true arbitrary_lba=not-supported. Proof:make run-cloud-provider-nvme-blockdevice-read. This closes concern (a) of theBlockDevice-shaped read gap (the same schema method, not the bespoke@22verb), restricted to the proof-fixed window. Future work (not yet implemented): arbitrary(startLba, count)parameterization, NVMe write/flush durability throughBlockDevice, areadonly_fs-style filesystem mounted over the NVMeBlockDevicecap, a dedicated I/O-completionInterruptroute, NLB > 1 with a PRP list, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
cloud-prod-nvme-blockdevice-arbitrary-lba-read-local-proofwidens theNvmeBrokeredarm off the hardwired SLBA 2 / NLB 1:readBlocks @0now honors an ARBITRARY(startLba, count)window (read-only, bounded to one PRP1 page). The shared single-call bodynvme_brokered_io_sync_command(kernel/src/device_manager/stub.rs) gains explicitslba/block_countfields onSyncIoParamsand authors CDW10/CDW11 (SLBA) and CDW12 (NLB) from them instead of the module constants; the transfer length isblock_count * 512, bounded fail-closed to one 4 KiB PRP1 page (block_count <= 8) so PRP2 stays 0. The existing@20/@21/@22callers keep passing the proof-fixed SLBA 2 / count 2, so their behavior is byte-identical (regression:make run-cloud-provider-nvme-io-sync-read-bytes). A new parameterized opnvme_brokered_io_sync_read_window_op_for_cap(handle, owner, slba, count, out_data)rotates the I/O SQ/CQ index off the kernel-side read sequence (window 0 at index 1 after the WRITE at index 0, window 1 at index 2) so each completion is polled at the CQ slot the controller actually writes. Undercloud_nvme_blockdevice_arbitrary_lba_proof(implies and supersedescloud_nvme_blockdevice_read_proof),BlockDeviceCap::nvme_read_blocksadmits any1 <= count <= 8window withstartLba + count <= namespace blocks(the IDENTIFY-derived NSZE reported throughinfo @2– 16 MiB / 512 = 32768 on the QEMU fixture image; see the READ-arm graduation entry), and fails closed with distinct errors forcount == 0/count > 8(... count out of range (1..=8)) and a window past the namespace end (... window past namespace end). The proof (kernel/src/cap/nvme_blockdevice_arbitrary_lba_proof.rs, a clone of the fixed-LBA module) drives the full bring-up plus the@20WRITE (seeding LBA 2 = pattern-0, LBA 3 = pattern-1), then issues TWO distinctreadBlockswindows –readBlocks(0, 1)(zero-filled LBA 0) andreadBlocks(3, 2)(LBA 3 = pattern-1, LBA 4 = zero-filled) – comparing each returneddatabyte-for-byte to the manager-authored content and asserting the two windows return distinct content. It emits onecloudboot-evidence: provider-nvme-blockdevice-arbitrary-lba-read <token>marker labeledarbitrary_lba=supported window0_slba=0 window0_count=1 window1_slba=3 window1_count=2 windows_distinct=true prp_pages=single nvme_arm_fixed_lba=false. Proof:make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read. With this, the NVMe namespace is readable throughBlockDevice.readBlocks @0at the LBA the consumer names. Future work (not yet implemented): NLB spanning more than one PRP1 page (count > 8) with a PRP list, NVMe write/flush durability throughBlockDevice, areadonly_fs-style filesystem mounted over the NVMeBlockDevicecap, a dedicated I/O-completionInterruptroute, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
cloud-prod-nvme-blockdevice-writeblocks-durability-arm-local-proofarms theNvmeBrokeredarm’swriteBlocks @1(read-only until now): it drives the brokered NVMe sync WRITE with the caller-supplied(startLba, count, data)and proves write-then-read-back durability. A new parameterized opnvme_brokered_io_sync_write_window_op_for_cap(handle, owner, slba, count, in_data)(kernel/src/device_manager/stub.rs) mirrors the arbitrary-window READ entry but rotates the I/O index off a kernel-side write sequence (next_write_io_index()=> index 0, before the read-back at index 1). The shared single-call bodynvme_brokered_io_sync_commandgains a third fill mode – awrite_payload: Option<&[u8]>that copies the caller’scount * 512bytes into block 0..count of the manager-owned write-data page (slot 6) through the HHDM mapping before the WRITE SQE is authored – beside the fixedprefill_patternand the readonly_fsseed_imagemodes (both unchanged: regressionsmake run-cloud-provider-nvme-blockdevice-arbitrary-lba-readandmake run-storage-fsstay green). Undercloud_nvme_blockdevice_writeblocks_proof(implies and supersedescloud_nvme_blockdevice_arbitrary_lba_proof),BlockDeviceCap::nvme_write_blocksadmits any1 <= count <= 8window withstartLba + count <= namespace blocksanddata.len() == count * 512, failing closed with distinct errors for zero count, over-capacity, past-namespace-end, and length mismatch;info @2reportsreadOnly = false;flush @3stays fail-closed (a real NVMe FLUSH, opcode0x00, is a distinct verb – see theflush @3capability below). The proof (kernel/src/cap/nvme_blockdevice_writeblocks_proof.rs, a clone of the arbitrary-LBA module) drives the full bring-up thenwriteBlocks(5, 2, data)with a caller-authored, non-zero, two-distinct-block 1024 B payload, followed byreadBlocks(5, 2), comparing the read-back byte-for-byte to the bytes written. It emits onecloudboot-evidence: provider-nvme-blockdevice-writeblocks-durability <token>marker labeledwrite_path=blockdevice-writeblocks write_method=1 write_slba=5 write_count=2 write_data_len=1024 readback_data_len=1024 write_readback_match=true nvme_arm_read_only=false flush=fail-closed prp_pages=single. Proof:make run-cloud-provider-nvme-blockdevice-writeblocks-durability. No schema/binding change (writeBlocks @1andreadBlocks @0round-trip through existing bindings). Future work (not yet implemented): awritable_fs/persistent_storeconsumer mounted over the NVMeBlockDevicewrite arm, a real NVMe FLUSH onflush @3, a dedicated I/O-completionInterruptroute on the data path, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
ddf-nvme-multiprp-blockdevice-window-local-proofextends the sameBlockDevice.writeBlocks @1/readBlocks @0round-trip to a three-page NVMe PRP window. Undercloud_nvme_blockdevice_multiprp_window_proof,BlockDeviceCap::nvme_write_blocksandBlockDeviceCap::nvme_read_blocksacceptcount <= 24for the local proof geometry while the default and older proof builds keep the one-pagecount <= 8bound. The sharednvme_brokered_io_sync_commandbody resolves primary read/write data pages from parked-pool slots 5/6, a manager-owned PRP-list page from slot 7, read extension pages from slots 8/9, and write extension pages from slots 10/11. For thewriteBlocks(5, 24, data)andreadBlocks(5, 24)proof window it authors PRP1 as the primary data page and PRP2 as a PRP-list page containing two little-endian page pointers, matching the NVMe PRP-list subset in NVMe Base Specification 1.4 §4.3. The provider still supplies only inlineDatathrough theBlockDeviceschema and never sees a host physical address, IOVA, PRP1, PRP2, PRP-list page address, SQE byte, doorbell offset, or doorbell value. Requests with zero count, count 25, namespace overflow, or length mismatch fail closed before any I/O SQ doorbell write. The release marker includes full-transfer FNV-1a hashes for the WRITE and read-back records so the kernel-side proof is not limited to the first two 16-byte block digests; the userspace smoke also compares all 12 KiB byte-for-byte. Proof:make run-cloud-provider-nvme-blockdevice-multiprp-window. -
cloud-prod-nvme-blockdevice-flush-local-proofarms theNvmeBrokeredarm’sflush @3(fail-closed until now): it authors a real NVMe FLUSH (NVM command-set opcode0x00, NSID-scoped, no data transfer) through the brokered sync command machinery and proves awriteBlocksthenflushreturns CQE status0and the written block survives the flush. A new parameter-free opnvme_brokered_io_sync_flush_op_for_cap(handle, owner)(kernel/src/device_manager/stub.rs) drives the shared single-call bodynvme_brokered_io_sync_commandwith a FLUSHSyncIoParams(opcode = 0x00,command_id = 8,slba = 0,block_count = 0), rotating the I/O index off a kernel-side flush sequence (next_flush_io_index()=> index 1, after the WRITE at 0 and before the read-back at 2). The shared body learns the FLUSH shape (gated on the opcode): it skips the one-PRP1-page data bound, authors the SQE with NSID only andPRP1 = 0/PRP2 = 0/CDW10..15 = 0(no data page touched), and the WRITE/READ data-bearing path stays byte-identical for non-FLUSH opcodes. Undercloud_nvme_blockdevice_flush_proof(implies and supersedescloud_nvme_blockdevice_writeblocks_proof, the flush proof’s true sibling, so the write/read arms and the whole brokered I/O chain are reused unchanged),BlockDeviceCap::nvme_flushreturns()when the FLUSH was authored + the SQ doorbell rung + the completion consumed + CQE status0, failing closed otherwise. The proof (kernel/src/cap/nvme_blockdevice_flush_proof.rs, a clone of the writeblocks module) drives the full bring-up thenwriteBlocks(5, 2, data),flush(), andreadBlocks(5, 2), comparing the post-flush read-back byte-for-byte to the bytes written. It emits onecloudboot-evidence: provider-nvme-blockdevice-flush <token>marker labeledflush_path=blockdevice-flush flush_method=3 nvme_flush_opcode=0x00 flush_cqe_status=0 write_then_flush_ok=true flush_data_transfer=none prp1=0 prp2=0 write_readback_after_flush_match=true reboot_persistence=deferred durability_proof=flush-completion-only virtio_flush_regression=green. Proof:make run-cloud-provider-nvme-blockdevice-flush. No schema/binding change (flush @3 () -> ()round-trips through existing bindings with its empty params/result). Future work (not yet implemented): an NVMe reboot-persistence pass and crash-consistency where the FLUSH barrier specifically changes the survival outcome (a flushed write surviving a forced poweroff an unflushed one would not), routingFile.sync/ the writable-fs / persistent-store sync through this FLUSH, a dedicated I/O-completionInterruptroute on the data path, NLB>1 spanning multiple PRP pages with a PRP list, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
cloud-prod-nvme-blockdevice-reboot-persistence-local-proofcloses the reboot-persistence gap the flush proof named first: it proves a normally committed- FLUSHED write survives a CLEAN reboot through the same
BlockDeviceinterface, the two-boot analogue ofrun-storage-persiston the NVMe arm. The Makefile recipe (make run-cloud-provider-nvme-blockdevice-reboot-persistence) creates ONEnvme.rawimage and boots the non-qemucloudboot kernel over it TWICE WITHOUT regenerating it between boots. The provider self-selects its boot phase by probing the LBA 5..6 window throughreadBlocks(5, 2) @0– the data window itself is the guard sentinel: boot 1 reads back all-zero (fresh namespace), takes the writer branch, and issueswriteBlocks(5, 2, data) @1+ a realflush() @3(CQE status0); QEMU restarts against the SAME backing file and boot 2 reads back the known payload, takes NO writer branch, and the single read-back verifies persistence. The proof reuses the landedwriteBlocks @1/flush @3/readBlocks @0arms and the brokered sync command machinery unchanged; the only kernel-internal additions are a flat per-boot single-call I/O op log (reworked from the flush proof’s rigid WRITE -> FLUSH -> read-back state machine so the verifier boot can record a READ with no prior WRITE in the same boot), the data-window phase select, and the cross-bootphase=1|2marker labels. The cap-waiter route + headline marker come fromkernel/src/cap/nvme_blockdevice_reboot_persistence_proof.rs(a clone of the flush module undercloud_nvme_blockdevice_reboot_persistence_proof, which implies and supersedescloud_nvme_blockdevice_flush_proof). Each boot emits onecloudboot-evidence: provider-nvme-blockdevice-reboot-persistence <token>marker carrying itsphase(phase=1 ... write_then_flush_ok=true flush_cqe_status=0 boot_role=writer-flushon boot 1;phase=2 ... reboot_persistence_match=true boot_role=verifier durability_proof=clean-reboot-persistenceon boot 2). The reboot-persistence gate is the cross-boot correlation: boot 1’s persisted block digests equal boot 2’s read-back block digests (and both equal the known payload). No schema/binding change. Future work (not yet implemented): crash-consistency where the FLUSH barrier specifically changes the survival outcome under an induced mid-flush crash (the analogue ofrun-storage-writable-recovery), routingFile.sync/ the writable-fs / persistent-store sync through this FLUSH, a dedicated I/O-completionInterruptroute on the data path, NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production.
- FLUSHED write survives a CLEAN reboot through the same
-
cloud-prod-nvme-blockdevice-flush-crash-consistency-local-proofcovers the flushed-write-survives half of crash-consistency: it proves a normally committed + FLUSHED write survives a FORCED poweroff (an abruptkill -9of the QEMU process AFTER the flush barrier completed), the NVMeBlockDeviceanalogue ofrun-storage-writable-recovery. The Makefile recipe (make run-cloud-provider-nvme-blockdevice-flush-crash-consistency) creates ONEnvme.rawimage, boots the non-qemucloudboot kernel over it in the BACKGROUND (boot 1: empty namespace -> writer branch ->writeBlocks(5, 2, data) @1- real
flush() @3, CQE status0), watches the kernel log for the bounded arming marker[nvme-blockdevice-flush-crash-consistency] kernel: flushed write armed; awaiting forced poweroff,kill -9s the QEMU PID (the forced poweroff AFTER the flush barrier), then boots a SECOND time over the SAME file WITHOUT regenerating it (boot 2: verifier -> singlereadBlocks(5, 2) @0read-back). The proof reuses the reboot-persistence predecessor’s two-boot phase select, the landedwriteBlocks @1/flush @3/readBlocks @0arms, and the brokered sync command machinery unchanged; the only kernel-internal additions over the predecessor are the phase-1 arm-and-spin window after the flush (on_releaseemits the arming marker and spins forever so the recipe cankill -9at that point) and the forced-poweroff marker labels. The cap-waiter route + headline marker come fromkernel/src/cap/nvme_blockdevice_flush_crash_consistency_proof.rs(a clone of the reboot-persistence module undercloud_nvme_blockdevice_flush_crash_consistency_proof, which implies and supersedescloud_nvme_blockdevice_reboot_persistence_proof). Boot 1 emits onecloudboot-evidence: provider-nvme-blockdevice-flush-crash-consistency <token>marker carryingphase=1 ... write_then_flush_ok=true flush_cqe_status=0 armed_forced_poweroff=true boot_role=writer-flush-armbefore the spin; boot 2 emits one carryingphase=2 ... flush_survives_forced_poweroff=true boot_role=verifier durability_proof=flush-survives-forced-poweroff. The crash-consistency gate is the cross-boot correlation: boot 1’s persisted block digests equal boot 2’s read-back block digests (and both equal the known payload), AND boot 1 reached the arm-and-spin window (was forcibly killed, did not take the verifier branch). No schema/binding change. Scoped honestly: “an unflushed write rolls back” is NOT provable under QEMU’s-device nvmecache=writebackmodel (the host page cache surviveskill -9), so the differential-rollback half is NOT claimed (unflushed_rollback=not-provable-under-qemu-nvme-model). Future work (not yet implemented): a dedicated I/O-completionInterruptroute on the data path, NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production. (Both higher-level consumer FLUSH routings are now closed: the writable-fsFile.synchalf bycloud-prod-nvme-consumer-sync-to-flush-local-proofand the persistent-Storeput-commit half bycloud-prod-nvme-persistent-store-sync-to-flush-local-proof, both below.)
- real
-
cloud-prod-nvme-dedicated-io-completion-interrupt-local-proofmoves the NVMeBlockDevice.writeBlocks @1/readBlocks @0data-completion handoff off the synchronous I/O-CQ poll return path and onto a dedicated dataInterruptroute. The spec basis remains NVMe Base Specification 1.4 submission/completion queue doorbells and completion entries (§3 controller registers, §4 queue management and CQ phase/status handling, §6 NVM WRITE / READ commands). The proof keeps table entry 0 for the CREATE I/O CQ/SQ admin completions and adds table entry 1 for the data I/O CQ completions; both routes are kernel-injected cap-waiter MSI-X routes, not a device-autonomous interrupt claim. The implementation entry points arekernel/src/cap/nvme_io_completion_interrupt_proof.rs(init,invoke_wait,invoke_acknowledge,poll_blockdevice_completions,emit_marker),kernel/src/device_manager/stub.rs(nvme_brokered_io_completion_interrupt_submit_op_for_cap,nvme_brokered_io_completion_interrupt_complete_op_for_cap,nvme_io_completion_interrupt_submit_record_buffers_live), andkernel/src/cap/block_device.rs(nvme_interrupt_write_blocks,nvme_interrupt_read_blocks,call_with_context). The manager still authors queue bases and PRP1 from the liveDMAPoolledger, copies the caller’s write payload into the parked write-data page, consumes the I/O CQ entry atInterrupt.acknowledge, advances the I/O CQ head doorbell (0x100c), and posts the deferredBlockDevicecompletion only after the bounded caller CQ has room. The proof (make run-cloud-provider-nvme-io-completion-interrupt) driveswriteBlocks(5, 2, data), waits/acks the data route, observes the deferred write completion, then drivesreadBlocks(5, 2), waits/acks the data route, and receives the read bytes through the standardblock_device::read_blocks_resultsdata field. The headline markercloudboot-evidence: provider-nvme-io-completion-interrupt <token>pinscreate.entry.0,io.entry.1,data_route_distinct_from_create_route=true, four dispatches/four deferred EOIs, both deferredBlockDevicecompletions posted, and a byte-for-byte write/read-back match. Scoped honestly: queue-base and PRP addresses remain hidden (host_physical_user_visible=0,iova_export=disabled-future-only,prp_source=manager-ledger); multi-PRP windows (count > 8), provider-written PRP/SGL/address lanes, live cloud, a second namespace, FUA/DSM, and device-autonomous MSI-X delivery remain future work. -
cloud-prod-readonly-fs-over-nvme-blockdevice-local-proofprovides areadonly_fs-style consumer over the NVMeBlockDevicearm: the read-only filesystem mount reads its sectors through the NVMeBlockDevicecap instead of the kernel-owned virtio-blk free functions.kernel/src/cap/readonly_fs.rsgains aBlockSourceseam abstracting the two reads a mount needs (device geometry + range read). The always-builtVirtiovariant routes to the samecrate::virtiofree functions, somake run-storage-fsstays byte-identical; theNvmevariant (built only undercloud_readonly_fs_over_nvme_proof) reads through a granted NVMe-backedBlockDevice– geometry from the IDENTIFY Namespace claim (see the READ-arm graduation entry), each chunked range read throughnvme_brokered_io_sync_read_window_op_for_cap, one 4 KiB PRP1 page per call. Because the brokered controller is brought up by the userspace provider, the NVMe rootDirectory(granted viaread_only_fs_root) defers its mount-parse to the firstDirectory.open. The proof (kernel/src/cap/readonly_fs_over_nvme_proof.rs, a clone of the arbitrary-LBA module) drives the full bring-up, then seeds a tinyCAPOSRO1image through the repurposed@20op (one manager-baked sector per call: superblock @ LBA 0, entry table @ LBA 1, file data @ LBA 2), mounts the filesystem over the NVMeBlockSource, opens the one seeded file, reads it, and compares the bytes. It emits onecloudboot-evidence: provider-readonly-fs-over-nvme <token>marker labeledread_path=readonly-fs-over-blockdevice fs_format=CAPOSRO1 block_source=nvme-blockdevice file_match=true superblock_via_nvme=true entry_table_via_nvme=true extent_via_nvme=trueafter the kernel verifies each read-back block-0 digest against the baked image. Proof:make run-cloud-provider-readonly-fs-over-nvme. The malformed-image fail-closed paths (bad superblock magic, out-of-range entry-table or file extent) are the unchanged sharedmount_root_inner/parse_entriesvalidation inkernel/src/cap/readonly_fs.rs– theBlockSourceseam swaps only the block-read backend, so the existingMountErrorchecks covered bymake run-storage-fsapply identically over the NVMeBlockSource; the NVMe arm additionally rejects an over-range range read with the arbitrary-LBA arm’s fail-closed error. Future work (not yet implemented): a multi-file directory walk /Directory.listtraversal over NVMe, files whose extents span many one-PRP1-page chunks, NVMe write/flush durability throughBlockDevice(the image is seeded via the manager-owned@20op, notwriteBlocks), a dedicated I/O-completionInterruptroute, and graduating the NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into always-built production. -
cloud-prod-readonly-fs-over-nvme-multifile-dirwalk-local-proofextends the read-only filesystem to multi-file directories: it lists a directory with more than one entry and reads two distinct files over the NVMeBlockDevicecap, one of which spans multiple 4 KiB chunks. The baked image (kernel/src/cap/readonly_fs_over_nvme_multifile_proof.rs, a clone of the single-file module) grows to 12 sectors – superblock @ LBA 0, a two-record entry table @ LBA 1, a one-sector small file @ LBA 2, and a 9-sector large file @ LBA 3..11 carrying a deterministic position-dependent byte pattern. The largeFile.readcovers nine sectors, so theread_rangechunk loop issues TWOBlockDevice.readBlocks @0calls (an 8-sector chunk @ LBA 3 + a 1-sector chunk @ LBA 11) – the multi-chunk path the single-file arm never exercised. The proof identifies each recorded read by(slba, count)and verifies it byte-for-byte with a per-read FNV-1a-64 over the full transfer (computed indevice_manageralongside the block digests), so a dropped trailing chunk fails closed. Because per-sector seeding (12) plus the filesystem reads (5) issue 17 single I/O commands and the monotonic I/O SQ/CQ index must stay inside one first CQ pass, the build raisesdevice_manager::stub::NVME_IO_QUEUE_DEPTHfrom 8 to 32 (createcdw10=0x001f0001); the change is inert for every other NVMe proof build. It emits onecloudboot-evidence: provider-readonly-fs-over-nvme-multifile <token>marker labeleddir_entry_count=2 file_count=2 files_distinct=true large_file_full_match=true large_file_read_blocks_calls=2 superblock_via_nvme=true entry_table_via_nvme=true extents_via_nvme=true(plus the single-file arm’s discipline labels). Proof:make run-cloud-provider-readonly-fs-over-nvme-multifile; the virtio mount path stays byte-identical (make run-storage-fs). Future work (not yet implemented): NVMe write/flush durability throughBlockDevice, a dedicated I/O-completionInterruptroute on the data path, NLB > 1 spanning multiple PRP pages in a single call, sub-directory trees, and graduating the NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into always-built production. -
cloud-prod-persistent-store-over-nvme-blockdevice-local-proofprovides a writable consumer over the NVMeBlockDevicewrite arm: the disk-backed persistentStoremounts over the NVMeBlockDevicewrite arm and proves a put-then-get durability round-trip.kernel/src/cap/persistent_store.rsgains a read+writeBlockSourceseam (mirroringreadonly_fs::BlockSourcebut with awrite_blocksmethod). The always-builtVirtiovariant routes to the samecrate::virtiofree functions byte-identically (including thedata_region_base_lba()installable-disk offset, folded into the variant), somake run-storage-persiststays green; theNvmevariant (built only undercloud_persistent_store_over_nvme_proof) reads throughnvme_brokered_io_sync_read_window_op_for_capand writes throughnvme_brokered_io_sync_write_window_op_for_cap, one 4 KiB PRP1 page per call. Because the brokered controller is brought up by the userspace provider, the NVMe rootStore(granted viapersistent_store) defers its mount-parse to the firstStorecall. The proof (kernel/src/cap/persistent_store_over_nvme_proof.rs, a clone of the writeblocks module) drives the full bring-up, then seeds aCAPOSST1superblock + empty entry table through the repurposed@20op (superblock @ LBA 0, entry table @ LBA 1), and exercises the grantedStore:Store.putwrites the data extent (LBA 2), entry-table sector, and superblock throughBlockDevice.writeBlocks @1, andStore.getreads the extent back throughBlockDevice.readBlocks @0. The kernel attests the put WRITE and get READ block-0 digests both equal the payload digest and differ from the pre-put (zero) extent, and userspace compares the returned bytes byte-for-byte. It emits onecloudboot-evidence: provider-persistent-store-over-nvme <token>marker labeledwrite_path=store-put-over-blockdevice-writeblocks read_path=store-get-over-blockdevice-readblocks consumer=persistent-store store_iface=Store block_iface=BlockDevice store_format=CAPOSST1 write_method=1 read_method=0 put_get_roundtrip_match=true durability_attested=true virtio_regression=green. Because the round-trip issues 8 single I/O commands (2 seed WRITEs + 2 deferred-mount READs + 3Store.putWRITEs + 1Store.getREAD) whose last monotonic CQ head reaches 8 – past the default depth-8 first pass – the build raisesdevice_manager::stub::NVME_IO_QUEUE_DEPTHfrom 8 to 16; the change is inert for every other NVMe proof build. Proof:make run-cloud-provider-persistent-store-over-nvme. No schema/binding change (Store.put/get/has/deleteandBlockDevice.writeBlocks @1/readBlocks @0round-trip through existing bindings). Future work (not yet implemented): routing the writable filesystem (CAPOSWF1) over the NVMe write arm, a real NVMe FLUSH onflush @3(stays fail-closed), an NVMe reboot-persistence pass, a dedicated I/O-completionInterruptroute on the data path, NLB > 1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
cloud-prod-writable-fs-over-nvme-blockdevice-local-proofmounts the full disk-backed writable filesystem over the NVMe arm (kernel/src/cap/writable_fs.rs, theCAPOSWF1node-table tree withmkdir/rename/removeand a fail-closed single-writer policy).writable_fscarries a read+writeBlockSourceseam mirroringpersistent_store’s: theVirtiovariant (built only in the qemu/ installable storage builds) routes to thecrate::virtiofree functions byte-identically (folding thedata_region_base_lba()offset), somake run-storage-writable/make run-storage-writable-recoverystay green; theNvmevariant (built only undercloud_writable_fs_over_nvme_proof) reads throughnvme_brokered_io_sync_read_window_op_for_capand writes throughnvme_brokered_io_sync_write_window_op_for_cap, one 4 KiB PRP1 page per call. Becausewritable_fsuses a process-wide singleton volume, the NVMewritable_fs_rootgrant stages the livedevice_mmiohandle and defers the singleton mount-parse to the firstDirectory/Filecall. The proof (kernel/src/cap/writable_fs_over_nvme_proof.rs, a clone of the persistent-store module that supersedes and drops it) seeds aCAPOSWF1superblock + root + one seeded file through the@20op, contiguously from LBA 256 (superblock @256, node table @257, seeded file extent @258), then exercises the granted filesystem. READ arm: opening the seeded file triggers the deferred mount, which reads the seeded extent (@258) back throughBlockDevice.readBlocks @0;File.readreturns the RAM copy the mount loaded. WRITE arm:File.writeto a fresh file lands a bump-allocated data extent (@259) + node-record + superblock throughBlockDevice.writeBlocks @1. BecauseFile.readserves the RAM content cache the mount loaded (not a fresh disk read), a same-extent disk re-read of a just-written file – which needs a remount/ reboot – is out of scope; the matching block-0 digests prove the same payload traversed both device arms, each device-acked. The single-writer policy is proven intact: aFile.writethrough a second grantedwritable_fs_rootcap fails closed. It emits onecloudboot-evidence: provider-writable-fs-over-nvme <token>marker labeledwrite_path=file-write-over-blockdevice-writeblocks read_path=file-read-over-blockdevice-readblocks consumer=writable-fs fs_iface=Directory file_iface=File block_iface=BlockDevice fs_format=CAPOSWF1 write_method=1 read_method=0 write_read_roundtrip_match=true durability_attested=true single_writer_policy=enforced second_writer_denied=true recovery_over_nvme=deferred virtio_regression=green. The round-trip issues 11 single I/O commands (3 seed WRITEs + 3 deferred-mount READs + 5File.writeWRITEs);NVME_IO_QUEUE_DEPTHstays 16. Proof:make run-cloud-provider-writable-fs-over-nvme. No schema/binding change. Future work (not yet implemented): the unclean-shutdown / forced- poweroff recovery window (recovery_crash_after_record) over the NVMe arm (the analogue ofrun-storage-writable-recovery, proved on virtio here), a real NVMe FLUSH onflush @3, an NVMe reboot-persistence pass, a dedicated I/O-completionInterruptroute on the data path, NLB > 1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into production. -
cloud-prod-writable-fs-over-nvme-recovery-local-proofproves the unclean-shutdown / forced- poweroff RECOVERY window (recovery_crash_after_record, the record-sector-written- but-superblock-not-yet-committed window inkernel/src/cap/writable_fs.rs) over the NVMeBlockDevicearm. A newcloud_writable_fs_over_nvme_recovery_prooffeature implies (and supersedes the happy-path proof module/route/init of)cloud_writable_fs_over_nvme_proofand widens thestorage_writable_recoverycrash-window cfg gate so the samerecovery-orphan.txtsentinel arms an induced forced poweroff when the writable filesystem is NVMe-backed. The recovery cap-waiter module (kernel/src/cap/writable_fs_over_nvme_recovery_proof.rs) reuses the NVMeBlockSourcearm, deferred mount, thirdWritableFsRootgrant arm, window ops, and I/O queue create unchanged. Unlike the happy-path proof, theCAPOSWF1image is HOST-BUILT (tools/mkstore-image --writable-nvmelays an empty superblock + root-only node table) rather than seeded through@20: a two-boot SAME-image recovery flow cannot re-seed on pass 2 without clobbering the pass-1 committed state, and the read ordering does not depend on a per-boot seed here (mirroring the virtiorun-storage-writable-recoveryproof, which also boots a host-built image twice).make run-cloud-provider-writable-fs-over-nvme-recoveryboots QEMU twice with-device nvmeagainst one shared raw drive file: pass 1 commits aFile.write+ sub-directory throughwriteBlocks @1, allocates the sentinel (its record sector lands on the namespace), and spins; the harnesskill -9s QEMU before the superblock commit. Pass 2 boots the SAME file, mounts by reading the old superblock + node table back throughreadBlocks @0, and asserts the recovered tree omits the orphan slot (exactly the committed entries remain), preserves the committed mutation (file size + content), accepts a usable post-recovery write, and denies a second-grant write (single-writer policy). The userspace smoke emits onecloudboot-evidence: provider-writable-fs-over-nvme-recovery <token>marker labeledcrash_window=record-written-superblock-uncommitted orphan_slot_ignored=true committed_mutation_survived=true post_recovery_write_ok=true recovery_over_nvme=true single_writer_policy=enforced durability_basis=host-page-cache real_flush=deferred reboot_persistence=deferred io_completion=polled interrupt_wait=not-used-on-data-path virtio_recovery_regression=green live_cloud=not-attempted; the kernel proof module’son_releaseindependently attests the cap-waiter route lifecycle + the two CREATE I/O queue dispatch/ack cycles. The two CREATE I/O queue commands keep their productionInterrupt.wait/acknowledgecap-waiter cycles; the data path stays polled. Bounded-proof caveat: one record-vs-commit window, host-page-cache durability (the two passes share one backing file; akill -9preserves the host page cache), NOT media crash-consistency and NOT a real NVMe FLUSH barrier. No schema/binding change. Future work (not yet implemented): a real NVMe FLUSH onflush @3, an NVMe clean- reboot-persistence pass, NLB > 1 spanning multiple PRP pages, a dedicated I/O-completionInterruptroute on the data path, and graduating the NVMe data plane into production. Proof:make run-cloud-provider-writable-fs-over-nvme-recovery. -
cloud-prod-nvme-consumer-sync-to-flush-local-proofroutes a consumer-levelFile.sync @4to a realBlockDevice.flush @3NVMe FLUSH media barrier instead of a write-side no-op.writable_fs::BlockSourcecarries aflush()arm (theVirtiovariant returnsOk(())– the driver negotiates noVIRTIO_BLK_F_FLUSH, so virtioFile.syncstays a byte-identical no-op andmake run-storage-writablestays green; theNvmevariant drivesnvme_brokered_io_sync_flush_op_for_capwith the same success predicate the read/write arms apply), andFile.sync @4(writable_fs.rs) routes through it AFTER theclaim_writergate. The featurecloud_nvme_consumer_sync_to_flush_proofcomposescloud_nvme_blockdevice_flush_crash_consistency_proof(arming the realflush @3op-for-cap) andcloud_writable_fs_over_nvme_proof(the consumer arm), dropping both predecessors’ proof modules. The proof (kernel/src/cap/nvme_consumer_sync_to_flush_proof.rs) seeds theCAPOSWF1image, thenDirectory.open(CREATE)+File.write(throughwriteBlocks @1), thenFile.sync()– which issues the real NVMe FLUSH (opcode0x00, CQE status0, no data transfer: PRP1 = 0, PRP2 = 0) – thenFile.readconfirming the bytes survive. The single-writer policy is shown intact: aFile.syncthrough a second granted cap fails closed BEFORE any FLUSH is issued (denied_sync_issues_no_flush=true), and the kernel asserts exactly one consumer-sync FLUSH (status0) was recorded. It emits onecloudboot-evidence: provider-nvme-consumer-sync-to-flush <token>marker labeledconsumer_sync_path=File.sync-to-nvme-flush sync_method=4 flush_method=3 nvme_flush_issued_by_consumer_sync=true nvme_flush_opcode=0x00 flush_cqe_status=0 write_sync_read_roundtrip_match=true single_writer_policy=enforced durability_proof=consumer-sync-issues-real-flush virtio_sync_noop=byte-identical. The round-trip issues 12 single I/O commands (3 seed WRITEs + 3 mount READs + 5File.writeWRITEs + 1File.syncFLUSH);NVME_IO_QUEUE_DEPTHstays 16. Proof:make run-cloud-provider-nvme-consumer-sync-to-flush. No schema/binding change. The bounded claim is consumer-sync-issues-real-flush, NOT a power-loss survival differential (unflushed_rollback=not-provable-under-qemu-nvme-model; the cross-boot forced-poweroff differential stays as the crash-consistency proof established it). Future work (not yet implemented): routing the persistentStore’s commit path through the FLUSH, graduating the NVMe data plane into production, NLB > 1 spanning multiple PRP pages, and a dedicated I/O-completionInterruptroute on the data path. -
cloud-prod-nvme-persistent-store-sync-to-flush-local-proofroutes the persistentStore’s put-commit path to a realBlockDevice.flush @3NVMe FLUSH media barrier. TheStorehas NOsyncschema method (put @0/get @1/has @2/delete @3only), so the routing point is the existing put-commit path, not a new method:persistent_store::BlockSourcegains aflush()arm (theVirtiovariant returnsOk(())– noVIRTIO_BLK_F_FLUSH, so the virtioStore.putcommit stays a byte-identical no-op andmake run-storage-persiststays green; theNvmevariant drivesnvme_brokered_io_sync_flush_op_for_capwith the same success predicate the read/write arms apply, but only when the flush lineage is composed so the plainmake run-cloud-provider-persistent-store-over-nvmecommit stays a no-op), andput_blob(persistent_store.rs) issues it AFTER theflush_superblockwrite (the ordering commit point) succeeds. A FLUSH that fails closed rolls back the in-RAMentry_count/next_free_sectorso no live index insert occurs. The featurecloud_nvme_persistent_store_sync_to_flush_proofcomposes (and drops/supersedes) thecloud_nvme_consumer_sync_to_flush_prooflineage (transitively the realflush @3op and the persistent-store-over-NVMe read+write seam). The proof (kernel/src/cap/nvme_persistent_store_sync_to_flush_proof.rs) seeds theCAPOSST1image, thenStore.put(data)– which writes the data extent / entry sector / superblock throughwriteBlocks @1, then issues the real NVMe FLUSH (opcode0x00, CQE status0, no data transfer: PRP1 = 0, PRP2 = 0) after the superblock commit – thenStore.get(hash)confirming the bytes survive. The kernel asserts exactly oneStore-commit FLUSH (status0) recorded AFTER the superblock write (superblock_commit_before_flush=true) and emits onecloudboot-evidence: provider-nvme-persistent-store-sync-to-flush <token>marker labeledconsumer_commit_path=store-put-to-nvme-flush put_method=0 flush_method=3 nvme_flush_issued_by_store_commit=true nvme_flush_opcode=0x00 flush_cqe_status=0 superblock_commit_before_flush=true put_get_roundtrip_match=true failed_flush_issues_no_live_entry=true virtio_commit_noop=byte-identical durability_proof=store-commit-issues-real-flush. The round-trip issues 9 single I/O commands (2 seed WRITEs + 2 mount READs + 3Store.putWRITEs + 1 put-commit FLUSH + 1Store.getREAD). Proof:make run-cloud-provider-nvme-persistent-store-sync-to-flush. No schema/binding change. The bounded claim is store-commit-issues-real-flush, NOT a power-loss survival differential (unflushed_rollback=not-provable-under-qemu-nvme-model). Future work (not yet implemented): graduating the NVMe data plane out of the per-proof features into always-built production, with the dedicated I/O-completionInterruptroute on the data path and NLB > 1 spanning multiple PRP pages. -
cloud-prod-nvme-sync-io-state-seam-always-built-local-proofextracts the brokered-NVMe synchronous-I/O state the shared op body depends on into ONE always-built moduledevice_manager::nvme_sync_io_state(kernel/src/device_manager/nvme_sync_io_state.rs), compiled in the default no-proofcargo build(not behind anycloud_nvme_*_prooffeature). The seam owns: the functional I/O SQ/CQ reservation cursor (reserve_io_slot()= a queue-depth-bounded first-pass slot reservation, with one live in-flight reservation, so a single-call command cannot reuse a stale CQE before the created queue wraps); the admissions predicate (sync_{read,write,flush}_admitted= bounded-ledger-not-full plus no active reservation); and the orderedSyncIoRecordop-log ledger (record_sync_{read,write,flush},DIGEST_BYTES, op kind implied byrecord.opcode=0x00FLUSH /0x01WRITE /0x02READ). The shared bodynvme_brokered_io_sync_commandand thenvme_brokered_io_sync_{read_window,write_window,flush}_op_for_cap/nvme_brokered_io_sync_read_bytes_op_for_capentries (kernel/src/device_manager/stub.rs) now record/admit/index through this seam instead of the per-proofnvme_io_proofalias; the alias stays only for the genuinely per-proof create/io-phase/seed ledgers (record_create_*,record_io_*,next_io_phase,next_seed_slba,seed_image_sector). All 15 NVMeBlockDeviceproof modules (kernel/src/cap/*_proof.rs) delegate to the seam: each deletes its privateSyncIoRecord/SyncHandoff/record/admit/index copy and reconstructs its release-marker view from the seam’s ordered op-log snapshot, byte-identical. The create-ordering / write-before-read orderings the per-proof harnesses formerly folded into their admit predicates are proof-harness assertions each proof still re-derives in its release marker; the always-built admit keeps only the production-honest bounded-ledger invariant (a real read needs no prior write). Code-location refactor only – no schema/binding change, no new device behavior; defaultcargo buildandcargo build --features qemustay warning-free (the seam carries a module-level#[allow(dead_code)]for its dormant not-yet-activated entry points). This UNBLOCKS the read-arm graduation: the read body’s sync-I/O symbols now resolve in the default build, so the graduation can make the read body always-built and gate activation behind a fail-closed runtime probe while the proof exercises the same always-built seam. Proof: every existing NVMeBlockDeviceproof stays green (make run-cloud-provider-nvme-blockdevice-arbitrary-lba-readand the rest of the chain) andmake run-netis byte-identical.
These brokered capabilities target the no-IOMMU QEMU/GCP lane, where queue-base
and PRP addresses are materialized by the kernel/device manager from the live
ledger. On a direct-remapping/vIOMMU gate the provider-written validator model
(nvme-userspace-bind-and-controller-bringup) applies instead. The PCI
metadata-only discovery summary (pci: nvme metadata ...) that also runs on
make run-pci-nvme is the separate enumeration-evidence surface in
kernel/src/pci.rs.
1. Spec basis
- Device: NVM Express PCI controller. PCI class
0x01(mass storage), subclass0x08(NVM), programming interface0x02(NVM Express). Detected byPciDevice::is_nvme_controller(kernel/src/pci.rs,NVME_CLASS_MASS_STORAGE/NVME_SUBCLASS_NVM/NVME_PROG_IF_NVM_EXPRESS). QEMU instance:-device nvme,drive=...,serial=...on theq35machine. - Authoritative spec: NVM Express Base Specification (NVMe 1.4 / 2.0). The
fields the validator relies on:
- Controller registers
CAP,CC(withCC.ENcontroller enable),AQA,ASQ,ACQ(NVMe Base Spec §3.1 controller register map).ASQ/ACQbase addresses have bits 11:0 reserved → 4 KiB page-aligned. - Submission/completion queue base addresses and the per-queue doorbell registers in the doorbell stride region (§3.1.x, §7.6 queue setup).
- Physical Region Page (PRP) entries PRP1/PRP2 and the PRP List (§4.3): PRP list pages and list-pointer PRP2 entries are page-aligned; a transfer that needs more than one PRP list page chains a further list, which this bounded subset does not follow.
- Controller registers
- Reference driver (optional cross-check): the Linux
drivers/nvme/host/queue-setup and PRP-build paths (nvme_setup_prps,nvme_pci_configure_admin_queue).
2. Wire format (validator-relevant subset)
The validator reads only the device-visible addresses a single doorbell newly
publishes, plus the byte extent of the region each names. It does not decode
command opcodes, data payloads, or completion entries. Scanned items are modeled
by ScanItem (kernel/src/cap/nvme_doorbell_validator.rs).
- Queue-base registers (
ScanItem::QueueBase): theASQ/ACQadmin queue bases (scanned on theCC.EN/ queue-arm write,ScanKind::QueueArm) and the I/OSQ/CQbases. The named region isentries × entry_sizebytes (e.g. an admin SQ isdepth × 64, an admin CQ isdepth × 16). Required alignment: page (4 KiB). - SQ entry PRP pointers (
ScanItem::Prp): for each NVMe command newly made visible by an SQ tail doorbell (ScanKind::SqTailDoorbell), the PRP1 data pointer, the PRP2 data-or-list pointer, and one level of PRP-list indirection.list_depthcounts indirection already followed (0= a PRP carried in the SQE,1= a pointer inside the single PRP list page);list_depth > 1is the out-of-subset deeper-chain case and fails closed (MAX_PRP_LIST_DEPTH). The named region is the transfer length (PRP data) or one page (a PRP list). Required alignment: page (4 KiB).
The scan is on-notify only: the provider may freely write its own mapped DMA pages between doorbells; nothing device-reachable happens until a doorbell rings, which is the single choke point the validator guards. Cost is O(descriptors published by this doorbell).
3. capOS mapping
The validator is the kernel half of the Model B genuine-userspace-driver model
(docs/proposals/nvme-model-b-doorbell-dma-validator.md): the provider writes
the device-visible queue-base and PRP addresses itself, and the kernel validates
them on the doorbell path rather than minting them (Model A, the unchanged
virtio-net TX path).
- Authority gate: the live doorbell-path hook derives the owning provider’s
identity (
OwnerToken) and live grant generation from theDeviceMmiogrant record in the device-manager ownership ledger, never from provider-supplied bytes. Thecfg(qemu)self-test resolves owner/generation from synthetic windows only. DeviceMmio: the validator is invoked from the pre-write step of the NVMe doorbell/queue-arm selected-writeDeviceMmioclaim (kernel/src/cap/device_mmio.rs; the existingnotifyDoorbellpath is the Model A virtio-net claim and does not trigger the validator). The scan completes — accept or reject — before the doorbell write is allowed to take effect, so the device never sees an unvalidated descriptor batch. BAR0 / doorbell pages stay device-uncacheable, NX, capability-scoped.DMAPool/ window descriptor: for a direct-remapping/vIOMMU lane, aDmaWindowcan name the owner’s domain-scoped IOVA range with a live generation, and provider-written values can be checked against that range. On the current no-IOMMU lane, there is no provider-visible non-host-physical device-address namespace; the manager owns the physical bounce pages and must materialize queue-base and PRP/SGL fields itself. Seedocs/dma-isolation-design.md(Provider-Written Addresses And No-IOMMU Brokered Bounce).Interrupt:completion_wakes_waiterenforces the stale-completion gate — a completion wakes a waiter only if its submission scan was accepted and the generation it was validated under is still live; an unvalidated or retired-generation completion does not wake a waiter.- Fail-closed / validation rules (
ScanReject, all reject with no doorbell write and no waiter wake):out-of-window,host-physical,cross-owner-alias,region-overrun,unaligned,deep-prp-chain,stale-generation, andinvalid-region. A doorbell rung after revoke/reset/regrant against a stale generation fails closed even when the byte value would have been in-window for the prior grant. - QEMU-emulable vs hardware-only: the validator mechanism and its hostile-scan
invariants are end-to-end provable in QEMU (
make run-pci-nvme, thenvme: validator ...proof lines). Live controller bring-up over a real NVMe controller — admin/I/O queue creation, IDENTIFY, and a bounded read with the validator gating the real doorbell — is QEMU-emulable too and is covered by the brokered bring-up capabilities (§4-§9), not the validator mechanism itself.
4. Userspace bind (read-only controller bring-up)
nvme-bind-claimed-mmio-read stands up the userspace storage-provider bind
foundation over the existing DDF driver foundation. It binds the controller
read-only – no register write, no DMA submission, and no doorbell – and so
leaves the controller’s existing (firmware-initialized) state untouched.
- Enumerate → claim → BAR0 preseed:
bind_qemu_nvme_controller(kernel/src/pci.rs) runs for the first enumerated NVMe controller after the metadata summary. It preseeds the first decoded memory BAR (BAR0 controller registers) for brokered reads (devicemmio_grant_source::preseed_read32_for_device), then claims the function and parks it underDeviceOwner::ManagerGrantSource. On any staging failure it falls back to thepci: nvme no-authority/no-driverline (fail-closed): a partially-staged authority surface is never advertised. - Grant-source staging: the same device-agnostic
{devicemmio,dmapool,interrupt}_grant_source::init_for_devicethe virtio path uses stage the bootstrap grants against the claimed NVMe handle. The virtio-net-specific provider-notify/doorbell selected-write claim is not staged here — that is the controller-enable path (§6). run-pci-nvme boots with no virtio devices, so the singletons are free for NVMe; run-net / run-ddf-provider-consumer (no-device nvme) keep the virtio bind untouched. - Brokered register reads: the
nvme-bringup-smokeprovider (demos/nvme-bringup-smoke/) holds the manifest-grantedconsole/dmapool/device_mmio/interruptcaps and readsCAP(0x0,0x4),VS(0x8),CC(0x14), andCSTS(0x1c) throughDeviceMmio.read32(the brokered boot-preseeded mapping indevice_manager::read_devicemmio_u32). It proves the bound claim reaches a coherent NVMe BAR0 by requiring a liveCAP(non-zero, non-floating) and a validVSversion (NVMe Base Spec §3.1.1/§3.1.2; QEMU reports 1.4.0), and reports the observedCC.EN/CSTS.RDY(§3.1.5/§3.1.6, bit 0 of each). - Firmware-initialized controller: under QEMU’s SeaBIOS BIOS boot, the
NVMe boot-probe enables the controller (
CC.EN=1,CSTS.RDY=1) before init runs, so the read-only bind observes a live controller. Bringing it to a known reset state (CC.EN=0, waitCSTS.RDY=0) before re-enabling with provider-owned admin queues is the controller-enable path’s responsibility (§6), not this read-only bind. - Proof line: the userspace
[nvme-bringup-smoke] controller-bind ok ... mmio_read=brokered controller_state=firmware-enabledread proof, asserted bytools/qemu-pci-nvme-smoke.sh. The kernel bind line advanced fromcontroller_init=read-only-bindtocontroller_init=reset-capable-bindwhen §5 added theCCselected-write claim to the same grant staging; the read proof itself is unchanged.
5. Userspace controller reset (selected-write CC claim)
nvme-controller-reset-selected-write is the first genuine userspace NVMe
controller-register write: it brings the firmware-enabled controller to a
known reset state. It does not enable the controller, program admin/IO
queue bases, submit DMA, or ring a doorbell – controller enable publishes
admin queue-base addresses and is the validator-gated path in §6.
- NVMe
CCselected-write claim: the NVMe bind stages theDeviceMmiogrant region with a reset-only selected-write claim (DeviceMmioWrite32ClaimProvider::NvmeControllerRegister,device_manager::nvme_controller_register_grant_region→proofs::nvme_controller_register_region), scoped to theCCregister (0x14, NVMe Base Spec §3.1.5).bind_qemu_nvme_controller(kernel/src/pci.rs) now callsdevicemmio_grant_source::init_nvme_controller_for_deviceinstead of the plaininit_for_device, so the single granted cap carries both the brokered read surface and theCCwrite claim. - Value-flexible, scoped: unlike the virtio variants (which pin an
exact
(offset, value)pair), the NVMe claim is offset-scoped and value-flexible only for reset invalidate_devicemmio_write32_claim(kernel/src/device_manager/qemu_full.rs): it admits anyCCwrite whoseCC.EN(bit 0) is clear – the read-modify-write reset – directly, fails closed on rawCC.EN=1writes withdevicemmio-nvme-cc-enable-raw-blocked, and fails closed on any write to a non-CCoffset (unclaimed-register-write). Refused writes perform no MMIO. - Reset sequence: the
nvme-bringup-smokeprovider readsCC, writes it back withCC.ENcleared throughDeviceMmio.write32(the volatile MMIO write indevice_manager::write_devicemmio_u32, resolved through the boot-preseeded BAR0 mapping), and pollsCSTS(§3.1.6) untilCSTS.RDYclears. QEMU clearsCSTS.RDYsynchronously on theCC.EN=1→0write (nvme_ctrl_reset). - No DMA validator involvement: a reset write (
CC.ENclear) publishes no queue-base or PRP addresses, so the Model B on-notify validator (§2/§3) is not invoked. The validator is invoked on the explicit brokered controller-enable queue-arm path (§6). - Proof lines (asserted by
tools/qemu-pci-nvme-smoke.sh):pci: nvme userspace-bind ... controller_init=reset-capable-bind ... cc_selected_write=staged(kernel),[nvme-bringup-smoke] cc-raw-enable-refused ...,[nvme-bringup-smoke] non-cc-write-refused ..., and[nvme-bringup-smoke] controller-reset ok ... csts_rdy_before=1 csts_rdy_after=0 cc_en_after=0 reset_write=performed ...(userspace).
6. Brokered controller enable (no-IOMMU, manager-authored admin queues)
nvme-no-iommu-brokered-controller-enable enables the controller on the
no-IOMMU make run-pci-nvme gate without exporting a host physical (== the
device-visible address on the bounce shape) or raw IOVA to the provider. The
provider invokes the explicit no-parameter
DeviceMmio.brokeredNvmeControllerEnable verb (schema @6); the manager
authors every address-bearing register and the selected CC value from its
live DMA ledger. Raw DeviceMmio.write32(CC, value with CC.EN=1) fails closed
before any MMIO side effect. The earlier
provider-written Model B enable (nvme-userspace-bind-and-controller-bringup)
stays blocked: it would require the provider to author a device-visible
queue-base, which the reviewed iova_export=disabled-future-only discipline
forbids on this gate.
- Brokered admin queue memory: the provider allocates the admin submission
and completion queue pages through the device’s
DMAPoolauthority (DmaPool.allocateBuffer), maps each read-write to fill it, and unmaps. By convention the manager reads the admin SQ from pool slot 0 (NVME_ADMIN_SQ_POOL_SLOT) and the admin CQ from slot 1 (NVME_ADMIN_CQ_POOL_SLOT). The pages stay live in the manager ledger;DmaBuffer.infocontinues to reportdevice_iova=0,iova_export=disabled-future-only,host_physical_user_visible=false. - Manager-authored queue-base registers: the @6 method dispatches through
nvme_brokered_controller_enable_op_for_cap(kernel/src/device_manager/qemu_full.rs), which validates the cap against the live NVMe controller-register claim and then callsnvme_brokered_admin_queue_enable. It resolves the admin SQ/CQ pages (record.attached_dmapools[..].proof_buffers[slot].page), then authorsAQA(0x24, zero-based admin queue sizes),ASQ(0x28/0x2c), andACQ(0x30/0x34) from the ledger page physical addresses vianvme_authored_register_write(volatile writes resolved only through the boot-preseeded BAR0 mapping), and finally performs the manager-selectedCC.EN | IOSQES=6 | IOCQES=4write. No provider-supplied controller bits or address-bearing value reaches the controller. - Validator on the queue-arm path: before any register write the authored
ASQ/ACQ bases are passed through the Model B on-notify DMA validator
(
crate::cap::nvme_doorbell_validator::validate_doorbell_scan,ScanKind::QueueArm). On this path the windows and the scanned items both derive from the same kernel ledger pages, so it is a self-consistency check on the kernel-authored bases: it proves page alignment and in-window containment of the named queue region (entries * entry_size– the admin SQ 128 B / CQ 32 B fit the 4 KiB page). The owner-identity, cross-owner-alias, host-physical, and stale-generation rejections are structurally unreachable here because both sides of each comparison come from the manager (the real authority gate against a stale/foreign page is the live-ledger membership check below); those hostile rejections are exercised by the boundedcfg(qemu)self-test (§3). A reject still fails closed before theCC.ENwrite. - Fail-closed before enable: raw
write32(CC, CC.EN=1)returnsdevicemmio-nvme-cc-enable-raw-blockedbefore any MMIO. The explicit manager-op’s real authority gate is live-ledger membership – an enable request with the admin queue pages unallocated, freed, or in-flight returnsnvme-admin-queues-not-armed(devicemmio-nvme-cc-enable-not-armed) with no MMIO side effect, covering the out-of-order manager operation and the post-free stale re-enable. A validator reject returnsdevicemmio-nvme-cc-enable-validator-reject. - Teardown under live admin queues: reset (
CC.EN=0) quiesces the controller (CSTS.RDYclears) before the admin queue pages are reused;DmaBuffer.freeBufferthen scrubs each page before the frame is freed (page_scrubbed_before_frame_free=true), and a subsequent enable with the queue memory gone fails closed. The enable path submits no admin commands, so there are no live completions or waiters during teardown; the “stale/unvalidated completion does not wake a waiter” property (completion_wakes_waiter) is proven by the boundedcfg(qemu)self-test (§3), not by the live admin-queue teardown. - Proof lines (asserted by
tools/qemu-pci-nvme-smoke.sh):[nvme-bringup-smoke] admin-queue-allocated ...(userspace),nvme: brokered-enable owner=nvme-storage trigger=manager-op admin_sq_slot=0 admin_cq_slot=1 validator=queue-arm scanned_items=2 aqa=0x00070007 cc=0x00460001 asq_authored=true acq_authored=true cc_en_write=performed cc_bits_selected_by=manager queue_base_source=manager-ledger host_physical_user_visible=false ...(kernel),[nvme-bringup-smoke] controller-enable ok ... cc_en_after=1 csts_rdy_after=1 ... brokered_enable_trigger=manager-op ...,[nvme-bringup-smoke] teardown-reset ok ... quiesced=true,[nvme-bringup-smoke] admin-queue-freed ... page_scrubbed_before_frame_free=true, and[nvme-bringup-smoke] stale-enable-refused ... brokered_enable_trigger=manager-op reason=nvme-admin-queues-not-armed(userspace). - Not in scope (this path): I/O queues, read/write commands, cloud
evidence, and host-physical/IOVA export are out of scope for the enable path.
hostile_hardware_isolation=not-claimed; the brokered no-IOMMU enable is not hostile-hardware isolation. One brokeredIDENTIFYadmin command is in §7.
7. Brokered admin command + IDENTIFY (no-IOMMU)
nvme-admin-queue-identify extends the brokered no-IOMMU lane to one admin
command. After the §6 enable, the provider submits a single IDENTIFY
(controller) admin command and consumes its completion from its own mapped admin
CQ. As on the enable path, the manager authors every address-bearing field; the
provider supplies only the non-addressing command dwords and the doorbell index.
nvme-admin-interrupt-delivery then makes that completion interrupt-driven: the
provider unmasks the admin completion interrupt route and blocks on
Interrupt.wait, the kernel wakes the live waiter through the device-interrupt
dispatch path, and only then is the completion consumed from the mapped CQ.
- Admin command (wire subset): a 64-byte submission queue entry (NVMe Base
Spec §4.2). The provider writes opcode
0x06(IDENTIFY, §5.17) at byte 0, a command id at bytes 2:3,NSID=0, andCNS=0x01(Identify Controller) in CDW10 at bytes 40:43 into the mapped admin SQ page. It leaves the address-bearing MPTR (bytes 16:23), PRP1 (bytes 24:31), and PRP2 (bytes 32:39) zero; the manager overwrites them. The IDENTIFY data structure is 4096 bytes, so a single page-aligned PRP1 covers it and PRP2 stays zero. - Doorbells: the same
NvmeControllerRegisterclaim covers the admin SQ tail doorbell (0x1000) and CQ head doorbell (0x1004) – admin queue 0 with doorbell strideCAP.DSTRD=0(NVMe Base Spec §3.1.24/§3.1.25;nvme_brokered_admin_sq_doorbellre-readsCAP.DSTRDand fails closed on a non-zero stride). The doorbell value (the tail/head index) is not address-bearing; the manager bounds it to<= NVME_ADMIN_QUEUE_DEPTHand performs the write. - Manager-authored PRP on the submit path: a write to the SQ tail doorbell
is routed by
validate_devicemmio_write32_claim(NvmeBrokeredWriteOp::AdminSqTailDoorbell) tonvme_brokered_admin_sq_doorbell(kernel/src/device_manager/mod.rs). It resolves the live admin SQ (slot 0), CQ (slot 1), and IDENTIFY data (slot 2,NVME_ADMIN_DATA_POOL_SLOT) ledger pages, authors the SQE’s MPTR/PRP1/PRP2 from the data page physical address through the SQ page’s HHDM mapping (nvme_author_admin_sqe_prp), fences, then rings the doorbell vianvme_authored_register_write. No provider-supplied address reaches the controller. - Validator on the SQ-tail path: before authoring the SQE the data-buffer
PRP1 is passed through the Model B on-notify DMA validator
(
validate_doorbell_scan,ScanKind::SqTailDoorbell,ScanItem::Prp { list_depth: 0 }): page alignment and in-window containment of the 4 KiB data region against the data page’s own device-visible window. A reject returnsdevicemmio-nvme-admin-submit-validator-rejectwith no doorbell write. As on the queue-arm path the window and scanned item are both manager-derived, so the hostile owner/host-physical/stale rejections are exercised by thecfg(qemu)self-test (§3); the live authority gate is live-ledger membership. - Interrupt-driven completion wake (
nvme-admin-interrupt-delivery): after ringing the SQ tail doorbell, the provider unmasks the admin completion interrupt route and blocks onInterrupt.wait. The route is the NVMe controller’s bootstrapInterruptgrant (DeviceOwner::ManagerGrantSource, MSI-X table entry 0, roleGrantSource); the kernel wakes the live waiter with a real LAPIC dispatch routed through the device-interrupt dispatch slot plus the deferred-EOI and waiter path – the same grant-source delivery model proven bymake run-interrupt-grantand used by the virtio-net provider. The wait returnsresult=interrupt-delivered real_interrupt_delivery=delivered wake_blocked=falsewith the route’s dispatchdelivery_countincremented. This is a kernel-injected dispatch at the route’s programmed LAPIC vector, not a device-autonomous MSI-X raise: the NVMe MSI-X table is not yet hardware- programmed for an external write (msix_table_programming=not-written, as on the DDF interrupt-grant path), so device-raised MSI-X delivery and MSI-X table programming remain a documented next increment. - Completion consume: only after the interrupt-driven wake does the provider
read completion queue entry 0 in its mapped admin CQ page (NVMe Base Spec
§4.6): a 16-byte entry whose DW3 (bytes 12:15) carries the command id (bits
15:0), the phase tag (bit 16), and the status field (bits 31:17). It checks the
status field is success and the command id matches, confirms the controller
DMA-wrote the data structure (non-zero PCI Vendor ID at IDENTIFY byte 0; QEMU’s
nvmereports0x1b36), then advances the CQ head doorbell. The completion is thus consumed after an interrupt-driven wake, with the mapped-CQ read as the consume step. - Stale/post-reset no-wake: after the IDENTIFY completes, a second live
Interrupt.waitwaiter is installed on the (driver-unmasked) route, observed to stay pending, and the route is then masked; the live waiter completesresult=interrupt-waiter-cancelled reason=route-maskedrather than woken (masked_live_waiter_woke=false). At the kernel layer the stale/unvalidated/retired-generation completion no-wake invariant (completion_wakes_waiter) remains proven by thecfg(qemu)self-test (§3,waiter_wake=none). - Fail-closed before submit: a SQ tail doorbell with the admin SQ/CQ or data
page unallocated, freed, or in-flight returns
nvme-admin-command-not-armed(devicemmio-nvme-admin-submit-not-armed) with no MMIO side effect; an out-of-range doorbell index returnsdevicemmio-nvme-admin-doorbell-out-of-range. Teardown frees the data page first and proves the post-free re-submit fails closed. - Proof lines (asserted by
tools/qemu-pci-nvme-smoke.sh):nvme: admin-submit owner=nvme-storage admin_sq_slot=0 admin_data_slot=2 validator=sq-tail-doorbell scanned_items=1 command=identify-controller prp1_authored=true ... doorbell_written=performed host_physical_user_visible=false(kernel),[nvme-bringup-smoke] admin-interrupt-route-unmasked ... route_state_after=driver-unmasked,[nvme-bringup-smoke] admin-interrupt-wake result=interrupt-delivered real_interrupt_delivery=delivered wake_blocked=false ... interrupt_driven_wake=delivered(userspace, the interrupt-driven wake),[nvme-bringup-smoke] identify-complete ok command=identify-controller cid=0x0042 status=0x0000 phase=1 ... identify_vid=0x... completion_consumed= mapped-admin-cq-after-interrupt-wake ...(userspace),nvme: admin-complete-ack ... cq_head=1 ... address_bearing=false,[nvme-bringup-smoke] identify-cq-head-advanced ...,[nvme-bringup-smoke] admin-interrupt-stale-no-wake ... result=interrupt-waiter-cancelled ... masked_live_waiter_woke=false, and[nvme-bringup-smoke] stale-submit-refused ... reason=nvme-admin-command-not-armed(userspace). - Not in scope (this path): I/O queue pairs, read/write commands, and the remaining out-of-scope items below are covered in §8.
8. Brokered I/O queue pair + bounded READ (no-IOMMU)
nvme-io-queue-and-read extends the brokered no-IOMMU lane to one I/O queue
pair and one bounded read – the last piece of the userspace NVMe
storage-provider foundation. After the §7 IDENTIFY, the provider creates one
I/O queue pair (queue id 1) through admin commands, then issues one READ on
it. As on every brokered path the manager authors each command’s
address-bearing PRP1 from a live ledger page; the provider supplies only the
non-addressing dwords and the doorbell index.
- I/O queue entry sizes: the re-enable (§6) must program
CC.IOSQES(bits 19:16, log2 of the 64 B SQ entry = 6) andCC.IOCQES(bits 23:20, log2 of the 16 B CQ entry = 4) before any I/O queue is created (NVMe Base Spec §3.1.5); aCC.EN1->0 reset clears all ofCC, so the provider sets them explicitly (resultingCC = 0x00460001). Creating an I/O queue withIOCQES/IOSQESunset is refused by the controller (QEMU returns command-specific Invalid Queue Size). - Create I/O queue commands (wire subset):
CREATE I/O COMPLETION QUEUE(opcode0x05, NVMe Base Spec §5.3) andCREATE I/O SUBMISSION QUEUE(opcode0x01, §5.4) are admin commands submitted on the admin SQ. CDW10 carries the zero-based queue size (bits 31:16) and queue id (bits 15:0); the create-CQ CDW11 sets PC=1 with IEN=0 (no I/O interrupt; the completion is polled), the create-SQ CDW11 sets the completion queue id (bits 31:16) and PC=1. PRP1 (the queue base) is left zero by the provider and authored by the manager from the I/O CQ (slot 3) / I/O SQ (slot 4) ledger page. - Opcode-directed PRP authoring: a write to the admin SQ tail doorbell is
routed to
nvme_brokered_admin_sq_doorbell(kernel/src/device_manager/mod.rs), which reads the opcode the provider wrote into the just-published SQE (at SQ indextail - 1) and maps it to the live ledger page whose device-visible address is that command’s PRP1:IDENTIFY-> IDENTIFY data (slot 2),CREATE I/O CQ-> I/O CQ (slot 3),CREATE I/O SQ-> I/O SQ (slot 4). An unrecognized opcode fails closed (devicemmio-nvme-admin-submit-unknown-opcode). The opcode is the only provider-supplied input consulted and is non-addressing. - READ command (wire subset): opcode
0x02(NVM Command Set §3.x), NSID 1, starting LBA 0 in CDW10/11, NLB 0 (zero-based, one block) in CDW12. PRP1 (the data buffer) is authored by the manager from the read data page (slot 5). - I/O doorbells: the same
NvmeControllerRegisterclaim covers the I/O SQ tail doorbell (0x1008) and I/O CQ head doorbell (0x100c) – queue id 1 with doorbell strideCAP.DSTRD=0(SQytail at0x1000 + (2y)*4, CQyhead at0x1000 + (2y+1)*4). The I/O SQ tail doorbell is routed tonvme_brokered_io_sq_doorbell, which requires the opcode to beREAD, materializes the READ PRP1 from the live read data page, validates it through the Model B on-notify validator (ScanKind::SqTailDoorbell), authors the SQE PRP, and rings the doorbell. The I/O CQ head doorbell (nvme_brokered_io_cq_head_doorbell) carries only the consumed-entry head index (no address-bearing field). - Completion consume: the create-queue and READ completions are consumed by
polling the mapped CQ phase tags. The single kernel grant-source injected
interrupt delivery is spent on the §7 admin IDENTIFY wake
(
delivery_count_before == 0gates injection to one delivery per route), so an interrupt-driven I/O completion wake awaits the device-autonomous MSI-X table-programming increment that §7 already defers. The provider confirms the controller DMA-transferred real data by checking the harness-seeded LBA 0 signature (0x4f504143= “CAPO”) in its mapped read data page – proving the read moved bytes through the brokered PRP, not a zero page. - Fail-closed before submit: an I/O SQ tail doorbell with the I/O CQ/SQ or
read data page unallocated, freed, or in-flight returns
nvme-io-command-not-armed(devicemmio-nvme-io-submit-not-armed) with no MMIO side effect; an out-of-range index returnsdevicemmio-nvme-io-doorbell-out-of-range; an opcode other thanREADreturnsdevicemmio-nvme-io-submit-unknown-opcode. Teardown frees the read data page first and proves the post-free I/O re-submit fails closed. - Proof lines (asserted by
tools/qemu-pci-nvme-smoke.sh):nvme: admin-submit ... command=create-io-cq ... admin_data_slot=3 sq_tail=2,[nvme-bringup-smoke] create-io-cq-complete ok ...,nvme: admin-submit ... command=create-io-sq ... admin_data_slot=4 sq_tail=3,[nvme-bringup-smoke] create-io-sq-complete ok ...,nvme: io-submit owner=nvme-storage io_queue_id=1 io_sq_slot=4 io_read_data_slot=5 ... command=read ... io_sq_tail=1 doorbell_offset=0x1008 doorbell_written=performed host_physical_user_visible=false(kernel),[nvme-bringup-smoke] io-read-complete ok command=read cid=0x0053 status=0x0000 ... io_read_dword0=0x4f504143 ... completion_consumed=mapped-io-cq-polled(userspace, the read data proof),nvme: io-complete-ack ... io_cq_slot=3 cq_head=1 ... address_bearing=false, and[nvme-bringup-smoke] io-stale-submit-refused ... reason=nvme-io-command-not-armed(teardown). - Not in scope: device-autonomous MSI-X delivery (hardware MSI-X table
programming, a device-raised I/O completion interrupt, and an interrupt-driven
I/O completion wake), multi-block / write / scatter-gather (PRP-list) I/O,
cloud (GCP/AWS/Azure) enumeration or evidence, and host-physical/IOVA export
remain out of scope.
hostile_hardware_isolation=not-claimed.
9. Production-path cloudboot proofs (non-qemu cloud kernel)
This section covers the non-qemu cloudboot kernel proofs. The older
cloud-prod-storage-bound-local-proof storage-bind path predates the later
production-stub NVMe manager operations: it binds
DeviceMmio/DMAPool/Interrupt surfaces to one NVMe function and
exercises an interrupt-dispatch proxy, but does not attempt controller enable,
admin commands, I/O queues, IDENTIFY, READ, or a userspace storage provider.
- Older storage-bind proxy:
cap::storage_bind_proof::report(kernel/src/cap/storage_bind_proof.rs) runs under#[cfg(not(feature = "qemu"))]duringkernel::run_init(kernel/src/main.rs). It selects an NVMe function withPciDevice::is_nvme_controller(kernel/src/pci.rs), stages a readbackDeviceMmiorecord throughdevice_manager::stage_bar_readback_region, a parked bounceDMAPool/DMABufferthroughdevice_manager::stage_bounce_buffer_dmapool_recordanddevice_manager::issue_manager_attached_dmabuffer_handle_with_request, and one MSI-X interrupt route throughdevice_interruptplus mask-first PCI MSI-X table programming. The I/O-completion evidence is a kernel-side proxy:device_interrupt::handle_lapic_deliveryadvances the live dispatch slot, deferred EOI is acknowledged, masked no-wake is checked, and teardown proves stale route/pool/buffer/MMIO handles fail closed. Its marker iscloudboot-evidence: storage-bound <token>, with summary fields such asnvme_admin_identify=not-attempted,nvme_read_command=not-attempted, andwaiter_wake=kernel-side-proxy. - Later production-stub manager ops: the non-
qemucloudboot kernel also implements real production-stub NVMe operations for the same local QEMU/cloudboot lane. The read-only bind, reset-onlyCC.EN=0selected-write claim, parked admin SQ/CQ/dataDMABuffermaterialization, brokered controller-enable manager operation (DeviceMmio.brokeredNvmeControllerEnable @6), and brokered adminIDENTIFY Controllermanager operation (DeviceMmio.brokeredNvmeAdminIdentify @7) live inkernel/src/device_manager/stub.rsand the production grant-source modules. These operations are not the storage-bind proxy: the manager authorsAQA/ASQ/ACQ,CC.EN=1, the fixed IDENTIFY SQE, PRP1, SQ tail doorbell, CQ polling, and CQ head doorbell from its parked ledger. The provider still supplies no host-physical address, IOVA, queue base, PRP/SGL, opcode, command id, doorbell offset, or doorbell value. - Local production provider chain: the moved
cloud-prod-nvme-brokered-userspace-provider-local-proofparent is closed by production-stub child records over the non-qemucloudboot kernel. The local QEMU/cloudboot chain reaches split admin completion (@8/@9plusInterrupt.wait/Interrupt.acknowledge), I/O queue creation (@10/@11), bounded READ/WRITE (@12-@15), second-LBA/multiblock I/O (@16-@19), synchronous read/write and read-bytes (@20-@22),BlockDevice.readBlocks/writeBlocks/ FLUSH, higher-level filesystem and Store consumers, a dedicated data-path completionInterruptroute, and multi-PRPBlockDevicewindows. These are manager-authored brokered operations inkernel/src/device_manager/stub.rsand the production proof modules, not provider-authored Model B doorbell writes. - READ-arm graduation to always-built production (
cloud-prod-nvme-storage-graduate-readarm-local-proof): the NVMeBlockDeviceREAD arm is the first capstone piece graduated OUT of the per-proofcloud_nvme_*_prooffeatures into always-built production code. TheBlockDeviceBackend::NvmeBrokeredarm and its arbitrary-windowreadBlocks @0body (kernel/src/cap/block_device.rs,cfg(not(qemu))), the shared read bodynvme_brokered_io_sync_command/nvme_brokered_io_sync_read_window_op_for_capand the brokered controller bring-up registers/helpers it reaches (kernel/src/device_manager/stub.rs), andlive_handle_for_nvme_blockdevicenow compile in the default no-proofcargo build/make capos-cloudboot-imagekernel – the GCE-validated production composition. ACTIVATION is fronted by a fail-closed runtime capability probekernel/src/nvme_storage_backend.rs(dma_backend.rs-style atomic verdict +select_nvme_blockdevice_handle()resolver): the cap is minted only when a stageddevice_mmiogrant resolves a live brokered-controller handle (recording the verdict), else a typed error is returned – never a panic. The no-NVMe default boot leaves the probe unverified, so theblock_devicegrant fails closed.writeBlocks/flushstay fail-closed on the graduated arm (named follow-up graduations). The graduated data plane is bounded, not a general-purpose driver: every command runs through the synchronous single-call seam (kernel/src/device_manager/nvme_sync_io_state.rs), which admits at most 64 single-call I/O commands per boot (MAX_SYNC_OPS) and permanently rejects further commands at the first I/O CQ wrap (no CQ phase-toggle handling) – both limits fail closed. Namespace geometry is IDENTIFY-derived, not assumed: after the fixed three-command bring-up sequence completes, the first geometry consultation issues one manager-authored IDENTIFY Namespace (CNS0x00, NSID 1, admin SQ index 3 / tail 4, NVMe Base Spec §5.17) throughnvme_namespace_geometry_for_cap(kernel/src/device_manager/stub.rs), parses NSZE plus the active LBA format (FLBAS + LBAFLBADS/MS), caches the verdict for the boot, and emitsnvme: brokered-identify-namespace ... nsze=... flbas=... lbads=... supported=....BlockDevice.info @2and thereadonly_fs/persistent_store/writable_fsNVMeBlockSource::inforeport this IDENTIFY-derived geometry, and every read/write window bound is enforced against it; while the claim is unavailable (bring-up incomplete, a failed claim reset the controller, or an unsupported format – anything other than 512 B data blocks with no interleaved metadata) those paths fail closed instead of falling back to a fixture constant. Proof:make run-cloud-provider-nvme-blockdevice-read-graduatedemitscloudboot-evidence: provider-nvme-blockdevice-read-graduated <token>(read_arm=always-built data_plane_feature_gated=false probe_verdict=verified nvme_read_roundtrip_match=true). This is a local QEMU/cloudboot proof; it does NOT claim a live cloud NVMe run, direct DMA, IOVA export, or a write/durability graduation. - Production boundary: one production-stub NVMe path now has live GCE
Persistent Disk evidence:
provider-nvme-io-readcompleted one brokered 512-byte READ on run1780806087-bf69. Other production-stub NVMe proofs remain local QEMU/cloudboot evidence unless their task record explicitly says otherwise. The current evidence still does not claim direct DMA, cloud/guest IOMMU support, provider-visible device addresses, device-autonomous MSI-X delivery, AWS/Azure storage, a reusable storage provider, or full filesystem integration. The NVMeBlockDeviceREAD data plane is graduated to always-built production (above); other write/FLUSH/filesystem consumers and broader windows have local proof coverage but remain bounded by their recorded production proof gates unless their task record explicitly says the surface was graduated.