virtio-net (modern PCI NIC)
This is a provenance map for the in-tree virtio-net driver: it cites the spec, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec – where the spec is implemented unchanged, it links rather than transcribes. The driver is mature and transitional (in-kernel today, slated to move to a userspace network-stack process), so the treatment is a concise map rather than exhaustive register tables.
1. Spec basis
- Device: virtio network device, modern (virtio 1.x) PCI transport.
PCI vendor
0x1af4; device0x1041(modern) /0x1000(transitional). IDs atkernel/src/pci.rs(VIRTIO_VENDOR_ID,VIRTIO_NET_MODERN_DEVICE_ID,VIRTIO_NET_TRANSITIONAL_DEVICE_ID; matched byPciDevice::is_virtio_net). - Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.1 (network device).
- Reference: cross-checked against the Linux
virtio_netandvirtio_pci_moderndrivers for the modern-transport handshake and split-ring layout.
2. Wire format (implemented subset)
Only the modern split-ring subset the driver uses is summarized here; feature bits and structures the spec defines but the driver does not specially handle are linked, not transcribed.
- PCI capabilities / BAR layout: virtio modern PCI vendor capabilities
(common / notify / ISR / device / PCI cfg) parsed from the capability list;
type constants
VIRTIO_PCI_CAP_COMMON_CFG/..._NOTIFY_CFG/..._ISR_CFG/..._DEVICE_CFG/..._PCI_CFGand length floorsVIRTIO_PCI_CAP_MIN_LEN/VIRTIO_COMMON_CFG_MIN_LENinkernel/src/virtio.rs; common-config register offsets are thetransport::COMMON_*constants (COMMON_DEVICE_FEATURE,COMMON_QUEUE_SELECT,COMMON_QUEUE_NOTIFY_OFF, …). The notify capability carriesnotify_off_multiplier(ModernTransport::notify_off_multiplier) used to compute per-queue notify addresses. - Split-ring layout: 16-byte descriptors (
transport::VIRTQ_DESC_SIZE), available and used ring offsets, and thetransport::VIRTQ_DESC_F_NEXT/transport::VIRTQ_DESC_F_WRITEflags. Descriptor lifecycle is generation-tracked through a boundedtransport::VIRTQ_DESCRIPTOR_TRACKING_SLOTSslot array (DescriptorTrackingSlot). - Queues: RX queue (
VIRTIO_NET_RX_QUEUE), TX queue (VIRTIO_NET_TX_QUEUE), negotiated to a bounded target size (VIRTIO_NET_QUEUE_TARGET_SIZE); the target size must not exceed the tracking slot count (compile-timeassert!againsttransport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS). - Net header / framing: 12-byte
virtio_netheader prepended to frames (VIRTIO_NET_HDR_LEN); proof TX buffers carry the header plus a minimum Ethernet frame (TX_PROOF_BUFFER_LEN,TX_PROOF_ETHERNET_OFFSET). - Feature negotiation: device/driver feature select/read registers in the
common config; the driver negotiates
transport::VIRTIO_F_VERSION_1/transport::VIRTIO_F_ACCESS_PLATFORM(generic, intransport) plus the net-specificVIRTIO_NET_F_MAC(1 << 5) and acknowledgesVIRTIO_NET_F_MRG_RXBUF(1 << 15).
3. capOS mapping
-
Binding (transitional): virtio-net is currently driven in the kernel. PCI/MSI-X transport discovery, the split-ring transport, smoltcp, TCP listeners, the line discipline, and the Telnet IAC filter live in
kernel/src/virtio.rsandkernel/src/cap/network.rs. This is explicitly transitional: Phase C of the networking proposal (docs/proposals/networking-proposal.md) moves the NIC driver and stack into a userspace network-stack process once the userspace-driver authority gate applies to it. Until then it does not bind through theDeviceMmio/Interrupt/DMAPoolprovider grants the DDF cloud-NIC drivers use; the sections below describe its kernel-owned equivalents. -
MMIO: modern-transport common/notify/ISR/device config regions are mapped from the device BARs and accessed through the
transportMMIO helpers (kernel/src/virtio.rstransportmodule). Doorbell (queue-notify) writes are scoped to the per-queue notify address computed fromnotify_off_multiplier; the DDFDeviceMmiocap (kernel/src/cap/device_mmio.rs) is the userspace successor surface. -
Interrupt: MSI-X vectors are programmed for config and per-queue interrupts; route records and vector dispatch are tracked by the kernel-owned device-interrupt ledger (
kernel/src/device_interrupt.rs). Themake run-netsmoke asserts MSI-X metadata selection, vector-pool/exhaustion policy, masked route lifecycle, queue vector assignment, descriptor guards, ARP, and ICMP. Device-autonomous delivery proofs live in the dedicated userspace-provider MSI-X gates, not in the retired kernel L4 owner. -
DMA: ring pages and TX/RX buffers are allocated and accounted through the net-keyed kernel DMA ledger (
kernel/src/device_dma.rs).make run-netruns without an emulated IOMMU, so DMA uses the intended bounce-buffer fallback; no host physical address or IOVA is exposed beyond the kernel boundary. -
Production cloud build cfg surgery (DMA ledger + DDF caps):
kernel/src/device_dma.rsand the cap surfaceskernel/src/cap/dma_pool.rs(DmaPoolCap/DmaPoolCapInfo) andkernel/src/cap/dma_buffer.rs(DmaBufferCap/DmaBufferCapInfo) compile in the non-qemubuild. Thecloud-prod-dmapool-bounce-buffer-grant-proofwires the first production caller throughkernel/src/cap/dmapool_bounce_buffer_grant_proof.rs: it stages a parked manager-attachedDMAPoolrecord over one DMA-capable PCI function from the inventory (stage_bounce_buffer_dmapool_recordinkernel/src/device_manager/stub.rs), builds aDmaPoolCapover the parked handle, allocates one bounded bounce-bufferDMABufferthroughdevice_manager::issue_manager_attached_dmabuffer_handle_with_request(which routes todevice_dma::allocate_manager_attached_dmapool_bounce_buffer_page), asserts cap-info labels (userspace_dmapool=manager-issued-bounce-buffer,allocation=single-bounce-buffer-page,real_dma=not-attempted,direct_dma=blocked,host_physical_user_visible=0,iova_export=disabled-future-only), thedma_backend::select_and_reportbounce-bufferverdict, quiesce-before- release (release_dmapool_record_for_cap_releasereturnspending-buffer-releasewhile the buffer is live), scrub-before-reuse (the released bounce-buffer frame is zeroed in place before the frame returns to the allocator), and stale-handle-after-detach, then emitscloudboot-evidence: dma-pool-grant <token>for the cloudboot harness. The qemu-only surface that stays gated includes thecap::dmapool_grant_sourcebootstrap source (kernel/src/cap/dmapool_grant_source.rs), theKernelCapSource::DmaPoolgrant arms inkernel/src/cap/mod.rsandkernel/src/cap/process_spawner.rs, theDmaBufferCompleteDescriptorAdmission::provider_cq_eventfield that carriescap::interrupt_grant_source::ProviderCompletionCqEventIdentity, and the entirekernel/src/device_manager/qemu_full.rsDDF backend (includingdevice_dma::{begin_virtio_net_pool, allocate_virtio_net_page, ...}). The proof maps no userspace VMA, programs no real DMA, attaches no queue, programs no interrupt, and emits noprovider-nic-bound/storage-bound; descendants indocs/backlog/hardware-boot-storage.md#cloud-device-trackscover those. -
Fail-closed rules: requested ranges are validated against device-reported geometry and destination buffer length before any device access; descriptor reuse is generation-tracked; the bounded tracking-slot array (
transport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS,DescriptorTrackingSlot) caps in-flight descriptors. Stale/over-range requests fail closed. -
QEMU-emulable vs hardware-only: fully QEMU-emulable. QEMU provides virtio-net-pci;
make run-netis the end-to-end proof. No hardware-only path – this is the local-binding reference the cloud NIC drivers (ENA, MANA, GCP virtio-net) mirror for their QEMU-provable halves. -
GCP cloud-shape classification: GCP 1st/2nd-gen x86 non-Confidential machine families (e.g.
n1-*,e2-*) present the virtual NIC as exactly this standard virtio-net device (vendor0x1af4) under a no-IOMMU / SWIOTLB bounce-buffer DMA backend, so the QEMUvirtio-net-pcibinding is the local precursor for the GCP NIC path. The enumeration path emits avirtio-net: cloud shape classificationproof line (kernel/src/pci.rsreport_cloud_virtio_net_shape) classifying the enumerated function against that documented GCP surface; bothmake run-netandmake run-ddf-provider-consumerassert it conjunctively with the GCP-mapped bounce-bufferdma: backend selectionline (kernel/src/dma_backend.rsselect_and_report). The GCP→bounce-buffer mapping itself is the support-policy expectation recorded indocs/research/cloud-dma-provider-evidence.md. The proof carries explicit scope flags (local_qemu_precursor=true,real_gcp_enumeration=not-claimed,gvnic=separate-driver-out-of-scope); live GCP enumeration and cloud used-ring ownership remaincloud-gcp-virtio-net-nic-driver. -
Production cloud-boot evidence marker (
dma-backend): the production boot path (the kernel built without theqemufeature, which is whatmake capos-cloudboot-imagepackages) emits the parseablecloudboot-evidence: dma-backend <token>serial marker thetools/cloudboot/harness reads (serial_marker_tokens; “Serial evidence-marker contract” intools/cloudboot/README.md). It is emitted bykernel/src/dma_backend.rsselect_and_report(always-compiled, so it fires on the production cloud image, not just theqemusmoke build) alongside the human-readabledma: backend selectionline. The marker uses the harness token namespace (direct_dma/trusted_domain/bounce_buffer), mapped from the resolvedDmaBackendbycloudboot_evidence_token– deliberately not theDmaBackendDisplay string (direct-remapping/bounce-buffer). The current two-variant resolved backend maps todirect_dma/bounce_buffer; on the probed GCE shapes (IOMMU disabled) the value isbounce_buffer. Thetrusted_domainslot has no current producer and is reserved. This marker is honest read-side evidence of the boot-time DMA-backend selection; it asserts no device bind and is independent of the bound-through-authorityprovider-nic-boundmarker, which remains thecloud-gcp-virtio-net-nic-driverclaim. -
Production cloud-boot evidence marker (
device-class): the production boot path also emits the companioncloudboot-evidence: device-class <token>serial marker (one per distinct enumerated PCI base class, harness-deduped viasort -u; “Serial evidence-marker contract” intools/cloudboot/README.md).- Spec basis: PCI base-class codes from the PCI Code and ID Assignment
Specification (PCI-SIG); the base class is the high byte of the
class-code/revision dword at config-space offset
0x08(kernel/src/pci.rsPCI_CLASS_REVISION). - Implemented wire-format subset: a genuinely read-only config-space scan
over the production source resolved from the boot-time MCFG probe: ECAM when
MCFG validates, otherwise legacy I/O.
report_cloudboot_device_class_evidence(kernel/src/pci.rs) walks each bus/device/function viafor_each_enumerated_functionand the read-onlyfunctions_to_scanhelper, reading only the vendor-id, header-type, and class-code (PCI_CLASS_REVISION) words. The base class is the high byte ofPCI_CLASS_REVISION. It deliberately does not callread_device/read_bars, which would perform transient BAR-sizing config writes. Each distinct base class is emitted once in ascending order, formatted{:#04x}(e.g.0x02). The marker is emitted fromkernel/src/main.rsrun_init, so it fires on every build configuration (not only theqemu/diagnosticsPCI-diagnostics path), including the non-qemuproduction cloud image. - capOS mapping: enumeration evidence only – it allocates no
DeviceMmio/Interrupt/DMAPool, claims no device ownership, performs no bus-master enable, BAR mapping, BAR-sizing write, or DMA, and never emitsprovider-nic-bound. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a
QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows base classes0x01(storage),0x02(network),0x03(display), and0x06(bridge). No GCE resources are created and nomake cloudboot-testrun is required.
- Spec basis: PCI base-class codes from the PCI Code and ID Assignment
Specification (PCI-SIG); the base class is the high byte of the
class-code/revision dword at config-space offset
-
Production cloud-boot evidence marker (
device-inventory): the production boot path also emits a per-function PCI claim-identity inventory so later bind children discover the real device identity instead of assuming the QEMU-fixed BDF layout the--features qemupath hard-codes. It emits a human-readablepci-inventory:detail line per enumerated function plus the parseablecloudboot-evidence: device-inventory <token>marker (one per function, harness-deduped viasort -u; “Serial evidence-marker contract” intools/cloudboot/README.md).- Spec basis: the PCI Local Bus Specification (PCI-SIG) Type 0 configuration
header — vendor/device ids at offsets
0x00/0x02, the class-code triple (base class / subclass / prog-if) in the high three bytes of the class-code/revision dword at offset0x08, header type at offset0x0e, and interrupt line / pin at offset0x3c(§6.1 “Configuration Space Organization”). BAR registers are not part of this production marker. - Implemented wire-format subset:
report_cloudboot_device_inventory_evidence(kernel/src/pci.rs) walks each bus/device/function viafor_each_enumerated_functionand the read-onlyfunctions_to_scanhelper. For each present function,read_cloudboot_inventory_recordreads only vendor/device, class/subclass/prog-if, revision, header type, interrupt pin, and interrupt line.report_cloudboot_inventory_recordformats one identity token:<seg>.<bus>.<dev>.<fn>-<vendor>.<device>-<class>.<subclass>.<progif>-rev.<rev>-hdr.<hdr>-irq.<pin>.<line>. It is emitted fromkernel/src/main.rsrun_initright after thedevice-classmarkers, on every build configuration including the non-qemuproduction cloud image. - capOS mapping: read-only enumeration evidence. The production marker
performs no BAR-size probe, config-space write, BAR mapping, bus-master /
memory-space / IO-space command-bit enable, doorbell write, DMA, or device
ownership claim, and never emits
provider-nic-bound. The later cloud-NIC bind children consume this inventory to resolve the real PCI function identity instead of the QEMU-fixed BDF fixtures; BAR/MMIO authority is proven by separateDeviceMmioevidence paths. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a
QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows the per-functionpci-inventory:lines anddevice-inventorymarkers for the emulated functions (virtio-net1af4, storage, display, bridge). No GCE resources are created and nomake cloudboot-testrun is required.
- Spec basis: the PCI Local Bus Specification (PCI-SIG) Type 0 configuration
header — vendor/device ids at offsets
-
Production-build
device_manager/DeviceMmiocompile surface:kernel/src/device_manager/mod.rsis now always compiled, but it is a thin orchestrator that re-exports a shared subset (error.rs,handles.rs,mmio.rs,types.rs—DeviceManagerError,DeviceMmioHandle/DeviceOwner/PciBdf/DeviceMmioRegion, the MMIO record / map / unmap / read32 / write32 admission types,DeviceMmioCapReleaseOutcome,ProviderNotifyDoorbellWrite) plus a feature-gated implementation: undercfg(feature = "qemu")it routes throughqemu_full.rs(the full DDF surface —dma_buffer.rs/dma_pool.rs/interrupt.rs/proofs.rs, NVMe brokered controller registers, IOMMU domain ledgers, virtio TX/RX ring publication); undercfg(not(feature = "qemu"))it routes throughstub.rs, which now carries a bounded one-slot parked-region path used by the production bar-readback proof (stage_bar_readback_region,validate_devicemmio_record,read_devicemmio_u32,detach_devicemmio_record_for_cap_release,trigger_*_for_devicemmio). The DMA/write/notify/map shims still reportDeviceMmioStaleHandlebecause no production caller exists yet for those; the descendant slices indocs/backlog/hardware-boot-storage.mdun-gate them through the reviewed grant path.kernel/src/cap/device_mmio.rsand itssuper::hardware_audit/super::hardware_release_logaudit hooks are likewise always compiled. TheKernelCapSource::DeviceMmiouser-facing grant arm inkernel/src/cap/mod.rsstayscfg(feature = "qemu")-gated; the production bar-readback proof builds itsDeviceMmioCapfrom boot (cap::devicemmio_bar_readback, see below) without going through that user-facing grant arm. Thecrate::iommumodule and the realkernel/src/virtio.rsstaycfg(feature = "qemu")-gated. Thecrate::device_dmamodule compiles in both builds for the dmapool-grant proof, and thecrate::device_interruptmodule compiles in both builds for the interrupt route/source allocation proof below; theirKernelCapSource::Interruptuser-facing grant arm andinterrupt_grant_sourcebootstrap-grant module inkernel/src/cap/mod.rsstaycfg(feature = "qemu")-gated. -
Production cloud-boot evidence marker (
device-mmio-bar-read): the production boot path also exercises one PCI function’s first memory BAR through the reviewedDeviceMmioCapread32surface and emits a parseablecloudboot-evidence: device-mmio-bar-read <token>marker.- Spec basis: PCI Local Bus Specification (PCI-SIG) Type 0 memory BAR
semantics. The marker carries the function’s BDF, the BAR index, the
32-bit value read at offset 0, and the kernel-mapped window length. The
kernel-side cache policy is device-uncacheable (UC) + NX + GLOBAL +
WRITABLE, matching the existing
mem::paging::map_kernel_mmio_rangecontract for MMIO windows. - Implemented wire-format subset:
cap::devicemmio_bar_readback::report(kernel/src/cap/devicemmio_bar_readback.rs) enumerates PCI functions viapci::enumerate(), picks the first with a memory BAR of at least 4 KiB at a non-zero base, maps the first 4 KiB of that BAR throughmem::paging::map_kernel_mmio_range, stages a parked region throughdevice_manager::stage_bar_readback_region(one slot, mapping generation monotonic), constructs aDeviceMmioCapover the resultingDeviceMmioHandle, and callscap.read32(0). The read goes through the samevalidate_devicemmio_record→ range/alignment check →read_volatilepath as the qemu DDF surface; on the production path the parked region’s recorded kernel virtual address backs the read. The marker token shape is<seg>.<bus>.<dev>.<fn>-b<bar>-<value>-len.<len>(value in 32-bit hex with0xprefix, length in hex bytes), inside the harness grammar[A-Za-z0-9._-]+. - Fail-closed assertions: the proof immediately retries
read32at exactlylengthand assertsrange_result != "ok"(out-of-range read is rejected with no MMIO side effect), then detaches the parked record throughdetach_devicemmio_record_for_cap_releaseand asserts the nextread32(0)fails closed at the device manager (DeviceMmioStaleHandle). Both outcomes are logged on adevicemmio-bar-readback: range_bounding .../stale_generation ...line so a regression trips the boot log alongside the missing marker. - capOS mapping: the mapping is boot-only kernel-half (no userspace VMA
is exposed by this proof); revocation drops the parked slot, which
invalidates the cap-side identity without removing the kernel mapping
itself (the boot-only mapping stays installed for the rest of the boot).
The descendant userspace-driver slices in
docs/backlog/hardware-boot-storage.md#cloud-device-tracksadd the userspace VMA path with TLB shootdown on revoke. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
a QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows the marker for the emulated virtio function. No GCE resources are created and nomake cloudboot-testrun is required. The qemu build keeps the existingmake run-devicemmio-grantsmoke as the end-to-end DDF proof; the bar-readback caller incap::devicemmio_bar_readbackis gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s ownDeviceMmioclaim path.
- Spec basis: PCI Local Bus Specification (PCI-SIG) Type 0 memory BAR
semantics. The marker carries the function’s BDF, the BAR index, the
32-bit value read at offset 0, and the kernel-mapped window length. The
kernel-side cache policy is device-uncacheable (UC) + NX + GLOBAL +
WRITABLE, matching the existing
-
Production cloud-boot evidence marker (
interrupt-route-allocated): the production boot path also exercises one PCI function’s MSI-X capability through the revieweddevice_interruptvector pool and emits a parseablecloudboot-evidence: interrupt-route-allocated <token>marker.- Spec basis: PCI Local Bus Specification (PCI-SIG) 3.0 §6.8.2 / PCI Express Base Specification 4.0 §7.7.2.2 MSI-X capability structure. The capability header dword exposes Control (function-mask, table-size-1, enable), the Table BIR/Offset dword exposes the BAR index in the low 3 bits and the byte offset in the upper bits (each table entry is 16 bytes), and the PBA BIR/Offset dword exposes the Pending Bit Array location. The marker carries the function’s BDF, the selected MSI-X table entry, the kernel-pool MSI vector, and the route/source generation pair allocated for the entry. No live MSI-X table write or device interrupt is performed on this path.
- Implemented wire-format subset:
cap::interrupt_route_alloc::report(kernel/src/cap/interrupt_route_alloc.rs) enumerates PCI functions viapci::enumerate(), walks each function’s capability list throughpci::capabilities, parses MSI-X capability fields throughpci::interrupt_capabilities/parse_msix_capability(offset,control,table_size,table_bir,table_offset,pba_bir,pba_offset, both validated through the existing MSI-X region BAR checks), picks the first MSI-X capability withtable_size >= 1, and allocates a kernel-owned MSI vector + interrupt source/route record over its first table entry (SELECTED_TABLE_ENTRY = 0) through the productiondevice_interrupt::register_pci_msix_route_by_bdfvector pool (kernel/src/device_interrupt.rs,lapic::DEVICE_MSI_VECTOR_BASE = 0x50, 16 device-MSI vectors). It thendevice_interrupt::claim_routes the route underDeviceInterruptDriver::ManagerGrantSource. The marker token shape is<seg>.<bus>.<dev>.<fn>-entry.<n>-vector.<hex>-src.<id>.gen.<g>-route.gen.<g>(vector in 2-digit hex, source-id and generations decimal), inside the harness grammar[A-Za-z0-9._-]+. - Fail-closed assertions: the proof asserts three invariants inline
before emitting the marker. (1) Claimed-state visibility:
validate_claimed_routesucceeds for the correctManagerGrantSourceowner and fails closed withWrongOwnerfor a distinctKernelIoApicProofowner – the route is owner-scoped. (2) Duplicate-source rejection: a secondregister_pci_msix_route_by_bdfagainst the same(bdf, table_entry)while the original route is live is rejected withDuplicateSource– the source identity is unique. (3) Stale-after-release:release_claimed_routeclears the slot and a subsequentvalidate_claimed_routeon the same handle fails closed withStaleRoute– no stale handle can re-enter the route table. Each outcome is logged on aninterrupt-route-alloc: claimed_state .../duplicate_source .../stale_after_release ...line so a regression trips the boot log alongside the missing marker. - capOS mapping: route/source-allocation evidence only. The proof
parses the MSI-X capability and consumes one slot from the kernel-owned
device-MSI vector pool, then returns it on release; it does NOT map
the MSI-X table or PBA BAR window, write a table entry, program a
LAPIC dispatch slot for live delivery, raise/handle a device
interrupt, install a waiter, acknowledge an EOI, or exercise
mask/unmask/reset on the live vector. No
provider-nic-boundorstorage-boundmarker. The follow-on live-delivery proof (interrupt-route-delivered) extends this surface; see the next section. Thecap::interrupt_route_alloccaller is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s ownInterruptclaim path. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
a QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows the marker for the emulated virtio function (QEMU’s modern virtio-pci front-end exposes a per-function MSI-X capability). No GCE resources are created and nomake cloudboot-testrun is required. The qemu build keeps the existingmake run-interrupt-grantandmake run-netsmokes as the end-to-end DDF and virtio-net MSI-X proofs.
-
Production cloud-boot evidence marker (
interrupt-route-delivered): the production boot path then extends the route-allocation proof to live MSI-X delivery: it programs the table entry, attaches the route to device manager, arms the deferred-LAPIC-EOI gate, injects one grant- source dispatch, retires the deferred EOI, masks and re-injects to prove no stale wake, reassigns to bump the route generation, asserts the stale handle + stale pending token both fail closed, then releases. Emits onecloudboot-evidence: interrupt-route-delivered <token>marker.- Spec basis: PCI Local Bus Specification 3.0 §6.8.2 / PCI Express
Base Specification 4.0 §7.7.2.2 MSI-X table entry layout (16-byte
entries: 64-bit message address, 32-bit message data, 32-bit vector
control with bit 0 = entry mask) plus the per-spec mask-first write
ordering (the entry mask must be asserted before message address/data
are torn). Intel SDM Vol. 3A §10.8 LAPIC EOI semantics for the
deferred-EOI write retired by
acknowledge_deferred_lapic_eoi_for_routeagainst the LAPIC EOI register (arch::x86_64::lapic::eoi). - Implemented wire-format subset:
cap::interrupt_delivery_proof::report(kernel/src/cap/interrupt_delivery_proof.rs) reusespci::map_msix_tableto map the MSI-X table BAR window kernel-side (UC + NX + GLOBAL + WRITABLE throughmem::paging::map_kernel_mmio_range) andpci::write_msix_table_entryto program entry 0 with the route’smessage_address(fromarch::x86_64::lapic::current_device_msi_delivery) andmessage_data(the allocated kernel-pool vector) under per-spec mask-first ordering. It then attaches the route throughdevice_interrupt::attach_claimed_route_to_device_manager, enables the deferred-LAPIC-EOI gate viadevice_interrupt::enable_deferred_lapic_eoi_for_route, unmasks the route throughdevice_interrupt::unmask_device_manager_attached_routeand the table entry throughpci::set_msix_table_entry_mask, drives one injected dispatch throughdevice_interrupt::handle_lapic_delivery(the same dispatch slot the qemumake run-interrupt-grantproof andnvme-admin-interrupt-deliveryexercise), retires the deferred EOI viadevice_interrupt::acknowledge_deferred_lapic_eoi_for_route, masks both surfaces and re-injects throughdevice_interrupt::record_lapic_delivery, reassigns viadevice_interrupt::reassign_claimed_routeto bump the route generation, and asserts stale-handle / stale-pending-token rejection throughdevice_interrupt::validate_claimed_route/device_interrupt::check_pending_lapic_token. - Fail-closed assertions: five inline assertions gate the marker.
(1) Live delivery:
handle_lapic_deliveryreturns aDelivered { .. }outcome bound to the live route’s(source_id, source_generation, route_generation, owner),delivery_countadvances by 1,eoi_deferred=true, andpending_deferred_eoi_count >= 1. (2) Ordered acknowledge:acknowledge_deferred_lapic_eoi_for_routereportseoi_written=true,ack_delta=1, andpending_after=0– each pending unit retires exactly one LAPIC EOI through the counter- based exclusiondevice_interrupt.rsdocuments atacknowledge_deferred_lapic_eoi_for_route/close_deferred_eoi_gate_and_drain. (3) Masked no-wake: after mask,record_lapic_deliveryreturnsMasked { state: ClaimedMasked, .. }anddelivery_countdoes not advance. (4) Reassign generation bump + stale handle: the prior handle’svalidate_claimed_routereturnsStaleRoute; the stale pending token’scheck_pending_lapic_tokenreportswake_blocked=truewith eitherUnregistered(the live evidence case: reassign’sfirst_available_vectorruns beforeclear_dispatch_slotretires the old vector, so the next pool slot is chosen and the stale token names an unregistered vector) orSourceRouteGenerationMismatch(the single-slot-pool degenerate case where reassign reused the same vector); and a fresh injected dispatch under the reassigned route + vector lands on the new generation while leaving the stale token blocked. (5) Release:release_claimed_routeclears the slot andvalidate_claimed_routeon the reassigned handle now fails closed withStaleRoute. Each outcome is logged oninterrupt-delivery: live_delivery .../ordered_acknowledge .../masked_no_wake .../reassign_stale .../release ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: route/source allocation + live delivery + ordered
acknowledge + mask/unmask + reset/reassignment + stale-route-generation
rejection, all on the production cloud kernel. The MSI-X table entry
is programmed but the PCI function-level
MSIX_CONTROL_ENABLEbit is intentionally NOT toggled (the proof never enables MSI-X on the function, so no real device-autonomous interrupt can fire on the programmed entry); the proof exits with the table entry re-masked. There is no userspaceInterruptwaiter on the production cloud kernel yet, so the proof’s “waiter wake” boundary is the kernel-side dispatch slot a real provider waiter would consume — the marker reportswaiter_wake=kernel-side-proxyrather than overclaiming a provider-cap-side wake. Noprovider-nic-boundorstorage-boundmarker. Thecap::interrupt_delivery_proofcaller is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s ownInterruptclaim path; the qemu build keepsmake run-interrupt-grantas the broader end-to-end exercise of the interrupt grant surface with the full DDF backend. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by a QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows the marker for the emulated virtio function. No GCE resources are created and nomake cloudboot-testrun is required. PBA handling is recorded by including thepba_birandpba_offsetfromMsixCapabilityInfoin the proof’sokline; the kernel does not read or clear PBA bits (devices set them, and this proof never enables the function so no PBA bit can be set in practice).
- Spec basis: PCI Local Bus Specification 3.0 §6.8.2 / PCI Express
Base Specification 4.0 §7.7.2.2 MSI-X table entry layout (16-byte
entries: 64-bit message address, 32-bit message data, 32-bit vector
control with bit 0 = entry mask) plus the per-spec mask-first write
ordering (the entry mask must be asserted before message address/data
are torn). Intel SDM Vol. 3A §10.8 LAPIC EOI semantics for the
deferred-EOI write retired by
-
Production cloud-boot evidence marker (
provider-nic-bound): the gate the billable GCE run consumes throughtools/cloudboot/NIC_PROOF_MARKER/--require-provider-nic-proof. It is sourced from real userspace driver progress: the marker fires only after the always-built polled virtio-net providercap::virtio_net_polled_providerhas completed a TX+RX over the live function and observed the RX completion by polling the latched used ring (zero kernel-injected interrupts). The predecessor staged its ownDeviceMmioDMAPool/DMABuffer+ MSI-XInterruptgrant surfaces at boot and proved the “queue-completion handoff” by callingdevice_interrupt::handle_lapic_delivery— a kernel-side dispatch-slot proxy (theinject_real_lapic_int_for_proofprecedent). That proxy is removed as the source of the gate:cap::provider_nic_bind_proof::reportnow runs once at boot and emits no marker (it records the deferral to the real provider’s completion); the marker is emitted later fromcap::provider_nic_bind_proof::report_real_completion, called from the provider’s release-time completion path.
- Spec basis: virtio 1.2 §2.7 split-ring used-ring semantics (the device
writes a used element; the driver observes
used.idxadvance) — the completion the provider polls; virtio 1.2 §5.1.6 virtio-net receiveq frame layout (12-byte modern header + ethernet frame) for the EtherType read-back; inherited MSI-X table layout / mask-first ordering (PCI 3.0 §6.8.2) only for the release-time route assertion chain, which never delivers an interrupt on the completion path. - Implemented wire-format subset:
cap::virtio_net_polled_provider(staged when the booted manifest declares thecloud-provider-nic-bound-real-polled-driver-smokebinary) drives the modern virtio status sequence toDRIVER_OK, materializes the RX virtqueue (queue 0) + TX stimulus virtqueue (queue 1), holds the PCI function-level MSI-X enable mask-first, maps the notify region, and programs the RX MSI-X route over table entry 0 (used only by the release-time assertion chain). Itsattempt_rx_submit(admitted from the userspaceDMABuffer.submitDescriptor(queue=0)) publishes the RX descriptor (VIRTQ_DESC_F_WRITE), drives the ARP TX stimulus, polls the latched used ring for the one real device->host RX DMA, and resets the device; itsinvoke_waitreads the latchedPublishedRxwithdelivery_countunchanged.report_real_completionthen sources theprovider-nic-boundtoken from thatPublishedRx(used.idx,used[0].id,used[0].len, observed EtherType) plus the picked function identity. - Fail-closed assertions:
report_real_completionre-asserts the real RX completion facts independently of the provider’s own gate before the marker is emitted. (1) Real device->host RX DMA: the latched used ring advanced exactly once (used.idx == 1), the completion is the posted descriptor (used[0].id == 0), the device wrote a non-empty frame (used[0].len > 0), and the provider read back a non-zero EtherType. (2) Polled, not injected: the provider’sInterrupt.waitadvanced no kernel dispatch (provider_observed_dispatch == 0) and retired no deferred LAPIC EOI (provider_observed_ack == 0). On any regression aprovider-nic-bind: real-completion regression (no marker): ...line trips the boot log and no marker is emitted, soprovider.json’sprovider_nic_proofstaysnulland--require-provider-nic-prooffails closed. - capOS mapping: the marker is now backed by real userspace driver progress
on the production cloud kernel. It carries the real-provider labels
waiter_wake=polled-used-ring,rx_completion=polled-used-ring,int_injected=0,userspace_driver_authority=present-real-polled-provider,virtio_common_config_write=performed,provider_tx_rx=completed,device_autonomous_raise=not-claimed,host_physical_user_visible=0,direct_dma=blocked,iova_export=disabled-future-only, andlive_cloud=not-attempted— never the predecessor’swaiter_wake=kernel-side-proxy/userspace_driver_authority=absent-on-non-qemu. The RXqueue_msix_vectorstaysVIRTIO_MSI_NO_VECTORand the PCI function mask stays held, so the device cannot autonomously raise the MSI either; the completion stays polled. The literalsystem.cuefold (so a plain default cloudboot also emitsprovider-nic-boundfrom real progress without a focused manifest) is not yet implemented, to avoid perturbing themake runinteractive shell/login boot; device-autonomous MSI-X is the parallel future workcloud-prod-virtio-net-rx-device-autonomous-msix-raise-local-proof. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-nic-bound-real-polled-driveron the default non-qemukernel with nocloud_*_prooffeature, on themake run-netdevice shape. No GCE resources are created;live_cloud=not-attempted.
-
Production cloud-boot evidence marker (
virtio-net-device-bringup): the production boot path, under the focused-proof Cargo featurecloud_virtio_net_device_bringup_proof, drives a bounded virtio-net device bringup sequence kernel-side over the same virtio function theprovider-nic-boundproof maps – but writes the virtio common-configuration status register (whichprovider-nic-boundnever does). It is the first device-activation step toward the still-blockedcloud-gcp-virtio-net-nic-drivertrack. Emits onecloudboot-evidence: virtio-net-device-bringup <token>marker on thetools/cloudboot/harness’s serial-port-1 path throughmake run-cloud-provider-virtio-net-bringup.- Spec basis: virtio 1.2 §3.1.1 device initialization (reset, ACKNOWLEDGE, DRIVER, feature discovery + driver-feature select, FEATURES_OK re-read, DRIVER_OK), §4.1 (modern virtio over PCI: common / notify / ISR / device / PCI-cfg capabilities, common-config register layout from Table 4.1).
- Implemented wire-format subset:
cap::virtio_net_device_bringup_proof::report(kernel/src/cap/virtio_net_device_bringup_proof.rs) picks the virtio-net PCI function (vendorVIRTIO_VENDOR_ID = 0x1af4, deviceVIRTIO_NET_TRANSITIONAL_DEVICE_ID = 0x1000/VIRTIO_NET_MODERN_DEVICE_ID = 0x1041) frompci::enumerate, walks the modern virtio PCI vendor-capability chain throughvirtio_transport::parse_modern_pci_transport_capabilities, maps the resolved common-configuration region throughpci::map_bar_region(UC + NX + GLOBAL + WRITABLE – same flags as the BAR-readback path), and drives the bringup using the sharedMmioRegionaccessors plusvirtio_transport::{read_device_features, write_driver_features, STATUS_ACKNOWLEDGE, STATUS_DRIVER, STATUS_FEATURES_OK, STATUS_DRIVER_OK, STATUS_FAILED, VIRTIO_F_VERSION_1, COMMON_NUM_QUEUES}. The selected driver feature word is exactlyVIRTIO_F_VERSION_1; no other device- or net-specific bit is accepted, so the proof never crosses into the queue-setup or descriptor surface the userspace virtio-net provider will own. - Fail-closed assertions: four inline assertions gate the marker.
(1) Negotiated feature set: the device’s offered 64-bit feature word
advertises
VIRTIO_F_VERSION_1, the written driver feature word equals exactlydevice_features & VIRTIO_F_VERSION_1. (2) Queue count visibility: the liveCOMMON_NUM_QUEUESread returns>= 2(virtio-net always exposes RX + TX virtqueues, which this proof does not publish). (3) DRIVER_OK observation: the post-DRIVER_OKstatus read carriesSTATUS_ACKNOWLEDGE | STATUS_DRIVER | STATUS_FEATURES_OK | STATUS_DRIVER_OKset withSTATUS_FAILEDclear. (4) Final reset: a write of0todevice_statusreads back as0. The proof wraps the status sequence so every exit path (success or any intermediate failure) writes0todevice_statusbefore returning, leaving the device in its post-reset state regardless of outcome. Per-stage outcomes log onvirtio-net-device-bringup: ok .../virtio-net-device-bringup: ... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of
provider-nic-boundthat extends the proven bind composition with virtio’s status sequence, kernel-side, over the same mapped BAR. The PCI function-levelMSIX_CONTROL_ENABLEbit stays untoggled, no queue is published, no descriptor is written, no doorbell is rung, and no userspace virtio-net provider cap is issued. The marker’s trailing labels (queue_setup=not-attempted,tx_descriptor=not-published,userspace_cap=not-issued,msix_function_enable=not-toggled,device_autonomous_raise=not-attempted,live_cloud=not-attempted) re-anchor those bounds. Queue setup, descriptor publication, doorbell writes, and a userspace virtio-net provider on the production cloud boot manifest stay deferred to the still-blockedcloud-gcp-virtio-net-nic-drivertrack. Thecap::virtio_net_device_bringup_proofcaller is gated tocfg(all(not(feature = "qemu"), not(feature = "cloud_provider_cap_waiter_proof"), feature = "cloud_virtio_net_device_bringup_proof")); the qemu build keepsmake run-net/make run-ddf-provider-consumeras the end-to-end exercise of the same surface with the full driver. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-bringup, which boots the focused-proof cloudboot kernel + manifest under QEMU and asserts the marker on serial. No GCE resources are created.
-
Production cloud-boot evidence marker (
nic-driver-userspace-features-ok): the userspace virtio handshake step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo featurecloud_virtio_net_userspace_features_ok_proof, thecap::devicemmio_grant_source_prodsource stages the picked virtio-net function’s modern virtio common-configuration window (resolved throughvirtio_transport::parse_modern_pci_transport_capabilities, mapped at the region’s first byte) as a writable selected-writeDeviceMmiogrant (stage_virtio_net_common_config). The userspacecloud-prod-nic-driver-userspace-features-ok-smokeservice then drives the virtio device handshake from userspace – the authority delta from the kernel-sidevirtio-net-device-bringupproof, which drives the same sequence in the kernel.- Authority delta: the handshake registers move from kernel-internal MMIO
to a userspace driver over the existing
DeviceMmio.read32/write32path. The write admission (device_manager::stub::write_devicemmio_u32under the feature) admits exactly four common-config registers –device_feature_select(0x00),driver_feature_select(0x08),driver_feature(0x0C), anddevice_status(0x14, written as a single byte) – each range-checked against the decoded BAR and kernel read-back-asserted (feature-register read-backs must echo the written value;device_statusis left to the driver’s own re-read since the device may legitimately diverge). This is the same selected-write + range-check + read-back discipline the notify doorbell (notifyDoorbell @5) and the NVMeCCreset write (cloud_nvme_controller_reset_proof) already enforce – not a new write primitive. - Fail-closed assertions: the shim drives reset -> ACKNOWLEDGE -> DRIVER ->
read device features -> write the negotiated driver features
(VIRTIO_F_VERSION_1 only) -> FEATURES_OK, re-reading
device_statusto confirm FEATURES_OK stuck, then proves aqueue_desc(0x20) write fails closed (result=write-blocked register_write=blocked). The released cap fails closed on the next call. The headlinecloudboot-evidence: nic-driver-userspace-features-ok <token>marker lands only after every assertion passes. - capOS mapping: the handshake step of the
Phase C userspace NIC driver relocation.
Queue/vring and IRQ ownership stay kernel-owned: queue-address registers
fail closed, so no buffer address is ever programmed (the userspace-ownable
vring over the DMA-isolation track is the next capability below). The marker’s
trailing labels (
handshake=features-ok,queue_setup=not-attempted,queue_address_write=blocked,vring=not-owned,irq=not-owned,driver_ok=not-attempted,live_cloud=not-attempted) re-anchor those bounds. The feature is mutually exclusive withqemu,cloud_provider_cap_waiter_proof, andcloud_virtio_net_device_bringup_proof. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-features-ok. No GCE resources are created.
- Authority delta: the handshake registers move from kernel-internal MMIO
to a userspace driver over the existing
-
Production cloud-boot evidence marker (
nic-driver-userspace-ownable-vring): the userspace-owned vring step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo featurecloud_virtio_net_userspace_ownable_vring_proof(which implies the handshake featurecloud_virtio_net_userspace_features_ok_proof), thecap::devicemmio_grant_source_prodsource stages the writable common-config window andcap::dmapool_grant_source_prodstages a bounce-bufferDMAPoolon the same virtio-net function. The userspacecloud-prod-nic-driver-userspace-ownable-vring-smokeservice drives the handshake to FEATURES_OK, then allocates and owns its own virtqueue rings.- Authority delta: the queue-address-class registers move from
kernel-internal MMIO (the
virtio-net-tx-queue-materializationproof programs them in the kernel) to a userspace driver over the sameDeviceMmio.write32path. The write admission (device_manager::stub::write_devicemmio_u32under the feature,admit_virtio_queue_address_write) admitsqueue_select(0x16) andqueue_size(0x18) as range-checked pass-through selected writes, and the 64-bitqueue_desc(0x20) /queue_driver(0x28) /queue_device(0x30) base registers via a token-resolve selected write: the driver writes the opaque per-buffer device-usable handle it learned fromDMABuffer.info(deviceIova, scopebounce-handle), and the kernel resolves it against the liveDMAPoolgrant ledger (resolve_virtio_vring_device_address) to the real bounce host-physical address, programs that address (never the handle, never an address the driver authored), and read-back-asserts. Reads of the queue-address base registers (0x20..0x38) are refused inread_devicemmio_u32, so the resolved host-physical address is never exposed to userspace (host_physical_user_visiblestays 0).queue_enablestays fail-closed (it is armed by the queue-enable/DRIVER_OK capability below). - Reuses landed DMA isolation: the ring pages are manager-owned
DMAPoolbounce buffers under the landed scrub-before-free / owner+slot generation / quiesce-before-release discipline (kernel/src/device_dma.rs); the no-host-physical-exposure posture (host_physical_user_visible=0,iova_export=disabled-future-only) is unchanged. This capability is wiring, not a new isolation backend. The opaque device-usable handle is a deterministic, non-address encoding of the buffer’s manager-owned identity under a fixed tag, so it can never collide with a page-aligned host-physical address and carries no host-physical information. - Fail-closed assertions: the shim allocates its descriptor / available /
used ring pages, programs each handle, then proves a queue-address read is
refused and that an out-of-grant handle, a raw host-physical-looking value
(
0x40000000), and a stale (freed-buffer) handle each fail closed (result=write-blocked register_write=blocked) before any MMIO write. The releasedDeviceMmiocap fails closed on the next call. The headlinecloudboot-evidence: nic-driver-userspace-ownable-vring <token>marker lands only after every assertion passes. - capOS mapping: the userspace-owned vring step of the
Phase C userspace NIC driver relocation.
The marker’s trailing labels (
vring=userspace-owned,queue_address_programming=token-resolved,host_physical_user_visible=0,provider_visible_queue_address=hidden,iova_export=disabled-future-only,out_of_grant=blocked,host_physical=blocked,stale_generation=blocked,queue_enable=not-attempted,driver_ok=not-attempted,irq=not-owned,live_cloud=not-attempted) re-anchor those bounds. - QEMU-emulable vs hardware-only: fully QEMU-emulable (the bounce backend is
the probe-selected default without a guest IOMMU). Proved locally by
make run-cloud-prod-nic-driver-userspace-ownable-vring. No GCE resources are created.
- Authority delta: the queue-address-class registers move from
kernel-internal MMIO (the
-
Production cloud-boot evidence marker (
nic-driver-userspace-queue-enable-driver-ok): the userspace queue-enable / DRIVER_OK step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo featurecloud_virtio_net_userspace_queue_enable_driver_ok_proof(which implies the ownable-vring featurecloud_virtio_net_userspace_ownable_vring_proof), the userspacecloud-prod-nic-driver-userspace-queue-enable-driver-ok-smokeservice drives the handshake to FEATURES_OK and programs its owned vring exactly as the ownable-vring capability does, then completes device bring-up from userspace: it arms its programmed TX queue and writesDRIVER_OK.- Authority delta: two more writes join the handshake/ownable-vring
selected-write admission, both under the same range-check + read-back
discipline. (1)
queue_enable(0x1c, u16): a range-checked pass-through selected write, admitted bydevice_manager::stub::write_devicemmio_u32(admit_virtio_queue_address_write) only when the active queue’s vring memory is live and page-fitting (selected_queue_ready_to_enable): the kernel reads the activequeue_desc/queue_driver/queue_deviceback kernel-side and requires each to currently hold the host-physical address of a live grantedDMABufferon this device (a freed buffer’s stale address no longer matches a live buffer, so it cannot arm a use-after-free DMA target), and requires the activequeue_sizeto fit every split-ring structure (16*sizedesc table,6+8*sizeused ring,6+2*sizeavail ring) inside one granted bounce page. An enable of an unprogrammed, freed, or oversized queue fails closed before any MMIO side effect; the enable is read-back-asserted. Once a queue is enabled its vring base registers are immutable – a queue-address repoint (even with an otherwise-valid live token) is refused (devicemmio-queue-address-immutable-after-enable) so the driver cannot mutate the vring under a running device. (2)DRIVER_OK(a bit in device-status 0x14): the device-status register is already writable (from the handshake capability), but settingDRIVER_OKis kernel-asserted – the kernel re-reads device-status and fails closed (devicemmio-driver-ok-not-observed) unless the device latched the fullACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OKbyte exactly (rejectingFAILED0x80,DEVICE_NEEDS_RESET0x40, and any reserved bit), so a userspace driver cannot claim a brought-up device the hardware did not accept. - Reuses landed DMA isolation: this capability adds no new register write
primitive, no new isolation backend, and no host-physical exposure. It reuses
the ownable-vring bounce /
DMAPool/DeviceMmiogrants and writable window unchanged; queue-address reads (0x20..0x38) stay refused (host_physical_user_visible=0). The enable binds to live, page-fitting, post-enable-immutable queue memory, so the device is never armed at a freed, oversized, or mutated vring. - Bounded residual (handled by the RX bring-up capability below): the
enable’s live + page-fit check
is point-in-time and matches by host-physical address, not buffer identity;
it does not pin the ring buffers against
freeBuffer/ process-teardown release while the queue is enabled. Both are use-after-free-DMA hazards only once a descriptor is posted and the doorbell rung – which this capability never does (frame_tx=not-attempted; the RX queue is never enabled; the TX queue is kick-driven), so no device DMA is reachable here and DMA stays confined to the granted bounce pool. Buffer-identity binding and pinning are the data path’s responsibility (vring_buffer_pinning=deferred-slice-4); tracked by the userspace RX/DMA task records. - Fail-closed assertions: the shim proves the ownable-vring out-of-grant /
host-physical / stale (freed-throwaway-buffer) queue-address writes fail
closed and an enable of the unprogrammed RX queue (index 0) fails closed,
then arms the programmed TX queue (
queue_enable=1,register_write=performed), setsDRIVER_OKand re-reads device-status to confirm the full brought-up byte, and proves a post-enable queue-address repoint (with an otherwise-valid live token) fails closed. The releasedDeviceMmiocap fails closed on the next call. The headlinecloudboot-evidence: nic-driver-userspace-queue-enable-driver-ok <token>marker lands only after every assertion passes. - capOS mapping: the userspace queue-enable / DRIVER_OK step of the
Phase C userspace NIC driver relocation.
The marker’s trailing labels (
vring=userspace-owned,queue_enable=performed,unprogrammed_queue_enable=blocked,device_brought_up=driver-ok,status_full=0f,driver_ok=observed,vring_live_bound=enforced,queue_size_fits_grant=enforced,post_enable_immutable=blocked,host_physical_user_visible=0,provider_visible_queue_address=hidden,frame_tx=not-attempted,nic_cap=not-implemented,irq=not-owned,live_cloud=not-attempted) re-anchor those bounds. TheNic-cap TX/RX round-trip (no frame crosses the wire here) and userspace IRQ ownership are later capabilities below. - QEMU-emulable vs hardware-only: fully QEMU-emulable (the bounce backend is
the probe-selected default without a guest IOMMU). Proved locally by
make run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok. No GCE resources are created.
- Authority delta: two more writes join the handshake/ownable-vring
selected-write admission, both under the same range-check + read-back
discipline. (1)
-
Production cloud-boot evidence marker (
nic-driver-userspace-rx-bringup): the userspace RX-queue bring-up step of the Phase C NIC-driver relocation track. Undercloud_virtio_net_userspace_rx_bringup_proof(implies the queue-enable feature) thecloud-prod-nic-driver-userspace-rx-bringup-smokeservice brings up the RX virtqueue (index 0) over its own vring – the handshake/vring/enable capabilities above brought up only the TX queue (index 1); thequeue_enableadmission is queue-agnostic, so RX bring-up reuses it.- capOS mapping: the kernel
(
device_manager::stub) retains each programmed queue’s vring physes + originatingDMABufferhandle identity onProductionDeviceRecord(admit_virtio_queue_address_write), bindsqueue_enableto that identity (selected_queue_identity_bound; a freed/realloc’d handle fails closed withdevicemmio-queue-enable-identity-mismatch), and pins the ring buffers againstfreeBuffer/ process-teardown release while the queue is enabled (blocked_pinned_enabled_vring→dmabuffer-pinned-enabled-vring), released only on queue disable/reset with device quiesce. This closes the queue-enable capability’s pre-migration buffer-lifetime/identity residual at the bring-up boundary. Marker labels:rx_queue_brought_up=driver-ok,buffer_identity_bound=enforced,vring_buffer_pinning=enforced,pinning_free_while_enabled=blocked,int_injected=0,nic_cap=not-implemented,irq=not-owned,live_cloud=not-attempted. - First real RX DMA: the same feature also
drives the first real RX DMA from the shim-owned vring. The shim also
brings up TX queue 1 over its own vring, posts one device-writable RX
receive buffer on queue 0 (
DMABuffer.submitDescriptor), and rings the productionDeviceMmio.notifyDoorbell @5. capOS mapping: the kernel maps the notify region kernel-side and captures the per-queue notify slot offsets (cap::devicemmio_grant_source_prod), andprovider_notify_doorbell_write_for_cap(wasErr(stale_handle)) is now a live drive; the RX-DMA flow (cap::virtio_net_userspace_rx_dma_proof, byte-level vring helpers duplicated fromcap::virtio_net_polled_providerto protectrun-net) writes the RX descriptor + avail over the shim’s retained RX vring physes, rings the RX doorbell, submits a kernel-half ARP request over the shim’s retained TX vring physes, polls one real device->host completion (used_len > 0, observed EtherType0x0806), resets the device (quiescing both queues and releasing the ring-buffer pins viamark_retained_vring_queue_disabled), and latches the used-ring index. Completion stays kernel-latched used-ring polled (int_injected=0, noInterruptcap). Marker labels addtx_queue_brought_up=driver-ok,frame_rx=performed,rx_used_ring=kernel-latched. The kernel emits onevirtio-net-userspace-rx-dma: rx_dma=performed ... used_len=<n> ethertype=0x0806 device_reset=ok queues_cleared=ok int_injected=0evidence line. - Not yet implemented: the deterministic freed-then-reallocated-frame identity
negative (
identity_realloc_negative=deferred-needs-allocator-reuse-seam). Thecapos-libFrameBitmapis next-fit andfree_framedoes not rewindnext_hint, so the allocation after a free never returns the just-freed frame; a deterministic same-phys realloc (needed to reach the buffer-identity gate rather than the host-physical gate) requires an allocator reuse seam. Tracked bycloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof. - Future work: the
Nic-cap round-trip (the next capability below, unblocked by this data path) and userspace IRQ ownership. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-rx-bringup. No GCE resources are created.
- capOS mapping: the kernel
(
-
Production cloud-boot evidence marker (
nic-driver-userspace-nic-cap-roundtrip): theNic-cap round-trip step of the Phase C NIC-driver relocation track. It implements the handshake-stepNicinterface stub as a liveCapObject. Under thecloud_virtio_net_userspace_nic_cap_roundtrip_prooffeature (implies the RX-bring-up feature) thecloud-prod-nic-driver-userspace-nic-cap-roundtrip-smokeservice brings the device fully up from userspace (RX queue 0 + TX queue 1 enabled, DRIVER_OK), then holds a typedNiccap and round-trips two sequential frames. capOS mapping:- The new
nicKernelCapSource(registered incapos-configmanifest.rslib.rs::NIC_INTERFACE_ID+ thenic @49schema/capos.capnpKernelCapSourceenum value; clientNicClientincapos-rt) is granted fromcap::nic_grant_source_prod, which maps the picked virtio-net function’s device-config window kernel-side formacAddress/linkStatusand binds theNiccap to that BDF.
transmit/receivedrive the shim’s retained vring physes throughcap::virtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive}(reusing the RX-bring-up byte-level vring helpers) with manager-owned kernel bounce payloads – not a shim-submittedDMABuffer– so a frame crosses the cap boundary as inlineDatawithhost_physical_user_visible = 0and no device-usable handle exported.receivedrives the coupled ARP-TX-stimulus + RX-poll and returns the frame inline + observed EtherType;transmitstages a frame into a manager-owned TX page and ringsnotify_doorbell @5.- The device is left live for the cap’s lifetime (a monotonic per-queue avail
cursor lets
transmitandreceivecompose without re-enabling) and quiesced once on cap release (nic_quiesce: device reset + queues-cleared assertion +mark_retained_vring_queue_disabledto release the enabled-vring pins). Completion stays kernel-latched used-ring polled (int_injected = 0, noInterruptcap); no new selected-write register beyond the landed handshake / ownable-vring / queue-enable set. The kernel emits twovirtio-net-userspace-nic-cap: receive ... used_len=<n> ethertype=0x0806 int_injected=0 host_physical_user_visible=0evidence lines and avirtio-net-userspace-nic-cap: quiesce ... device_reset=ok queues_cleared=okline on release. The proof also covers lifecycle ordering: aDMAPoolcap release while ring buffers are still live recordspending-buffer-release, an early release of one pinned ringDMABufferrecordsdmabuffer-pinned-enabled-vring,Nicquiesce replays that buffer detach after the queues are reset, and the pending parent pool release completes only after the remaining ring buffers are freed. - Future work: the clean independent TX/RX split and userspace IRQ ownership (both later capabilities below).
- QEMU-emulable vs hardware-only: fully QEMU-emulable (the RX reply is QEMU
SLIRP’s ARP answer to the kernel-half stimulus). Proved locally by
make run-cloud-prod-nic-driver-userspace-nic-cap-roundtrip. No GCE resources are created.
- The new
-
Production cloud-boot evidence marker (
nic-driver-userspace-irq-ownership): the userspace RX-interrupt-lifecycle ownership step of the Phase C NIC-driver relocation track. It gives the userspace NIC driver real RX-interrupt-lifecycle ownership. TheNic-cap round-trip capability above hasint_injected = 0and noInterruptcap on the data path; this capability adds a realInterruptcap whosewait/acknowledge/mask/unmaskdrive the route’s MSI-X vector-control + deferred LAPIC EOI (the frame bytes still arrive viaNic.receive’s used-ring read). Under thecloud_virtio_net_userspace_irq_ownership_prooffeature (implies thenic-cap-roundtripfeature) thecloud-prod-nic-driver-userspace-irq-ownership-smokeservice holds aDeviceMmio+DMAPool+Nic+Interruptcap on the same virtio-net function. capOS mapping:- A new
Interruptgrant source (cap::virtio_net_userspace_irq_ownership_proof) replaces the admission-onlyinterrupt_grant_source_prodsource via theKernelCapSource::Interruptarm under this feature. At boot it programs the staged virtio-net function’s RX MSI-X route (table entry 0) mask-first through the landed always-builtcap::interrupt_programmed::program_attach_arm_unmask(route register / claim / MSI-X table map+write / device-manager attach / deferred-LAPIC-EOI arm / unmask) and tears it down (teardown) on cap release. - The
Interruptcap’s methods are real for this device RX route:waitblocks on a real interrupt dispatch through the route’s MSI-X / LAPIC dispatch slot (device_interrupt::wait_kernel_injected_dispatch;delivery_countadvances, soint_injectedflips from 0 – theNic-cap round-trip capability had noInterruptcap on the data path at all). The wake is a bounded kernel-injected dispatch (not yet a device-autonomous raise causally tied to a frame), andNic.receivestill reads the frame bytes from the used ring, so the delta is IRQ-lifecycle ownership (realwait/acknowledge/mask/unmask), not interrupt-coalesced RX completion.acknowledgeretires exactly one deferred LAPIC EOI throughdevice_interrupt::acknowledge_deferred_lapic_eoi_for_route(hardwareDispatchAckDelta = 1, the one-ack-per-delivery /hardware_eoi_deltainvariant);mask/unmasktoggle the route’s own MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2: table-mask then route-state on mask; route-state then table-unmask on unmask) throughpci::set_msix_table_entry_mask+device_interrupt::{mask,unmask}_device_manager_attached_route(driver-unmasked<->claimed-masked). - The driver brings the device up from userspace (the
nic-cap-roundtripbring-up verbatim), drives the owned RX-interrupt lifecycle (info/wait/acknowledge/mask/unmask/release), and reads the completed frame back throughNic.receive(inlineData,host_physical_user_visible = 0). The PCI function-level MSI-X enable bit is not toggled and no device-autonomous raise is attempted (device_autonomous_raise=not-attempted,waiter_wake=kernel-injected-dispatch); the landed DMA isolation, the owned-vring grants, and buffer-identity / ring-buffer pinning are reused unchanged (queue-address reads still refused). No newInterruptinterface or method. - Future work: the clean independent TX/RX split (the next capability below);
the device-autonomous MSI-X raise (program the device RX
queue_msix_vector+ clear the PCI function mask) and the smoltcp network-stack relocation. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-irq-ownership(onecloudboot-evidence: nic-driver-userspace-irq-ownership <token>marker). No GCE resources are created.
- A new
-
Production cloud-boot evidence marker (
nic-driver-userspace-clean-tx-rx-split): the independent-TX/RX step of the Phase C NIC-driver relocation track. It decouples the last data-path coupling – the userspace NIC driver’sNic.transmitandNic.receivebecome truly independent. In thenic-cap-roundtrip/ IRQ-ownership capabilities,Nic.receive(virtio_net_userspace_rx_dma_proof::nic_receive) self-stimulated by submitting a kernel-half ARP TX over the retained TX vring inside the same call. Under thecloud_virtio_net_userspace_clean_tx_rx_split_prooffeature (implies theirq-ownershipfeature) theNiccap’sreceive @1dispatches instead tonic_receive_independent. capOS mapping:nic_receive_independentposts a manager-owned device-writable RX buffer on the retained RX vring, rings the RX doorbell, waits on the driver’s OWNED RX interrupt route (the IRQ-ownershipdevice_interrupt::wait_kernel_injected_dispatchdispatch slot, resolved throughvirtio_net_userspace_irq_ownership_proof::owned_rx_route;int_injectedflips from 0), retires the deferred LAPIC EOI (acknowledge_deferred_lapic_eoi_for_route), then polls the RX used ring and reads the completed frame – with no internal ARP-TX self-stimulus (it never submits to the TX vring; the kernel diagnostic reportstx_submissions=0 self_stimulus=removed).- The RX frame is driven by an external stimulus: the consumer’s preceding
independent
Nic.transmitof a real broadcast ARP request (who-has the QEMU SLIRP gateway 10.0.2.2). SLIRP answers; the inbound reply is held in the host net queue untilreceiveposts the RX buffer + kicks the RX queue. Nic.transmitstays independent: it submits the caller’s frame to the TX vring and rings the TX doorbell with no RX involvement (the kernel diagnostic reportsrx_polls=0 rx_submissions=0). Neither call performs the other’s submission.- The wake stays the bounded kernel-injected dispatch the IRQ-ownership
capability owns
(
waiter_wake=kernel-injected-dispatch,device_autonomous_raise=not-attempted). The landed owned-vring / owned-IRQ / DMA-isolation, the writable common-config window, and the buffer-identity / ring-buffer pinning are reused unchanged: no new selected-write register, no new MSI-X surface, no newNic/Interruptmethod, no host-physical / handle exposure (host_physical_user_visible = 0, queue-address reads refused). - Follow-up work: DHCP/IPv4 configuration, legacy kernel socket-path
retirement, kernel
smoltcp/ virtio-net hot-path removal, and the device-autonomous MSI-X raise. The 7c-ii(b) serve-from-userspace local proof is now landed. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-clean-tx-rx-split(onecloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token>marker). No GCE resources are created.
-
Production cloud-boot evidence marker (
nic-driver-userspace-sustained-receive-pool): Phase C slice 7d (DONE 2026-06-04) adds the sustained-receiveNicABI the multi-frame TCP path (7c-iii) needs. The landedreceive @1is single-frame + reset-on-empty-poll; this adds a non-resetting poll over a kernel-owned bounce RX pool. Under thecloud_virtio_net_userspace_sustained_receive_pool_prooffeature (implies the clean-split feature) theNiccap servesreceivePoll @4(cap::nic_grant_source_prod->virtio_net_userspace_rx_dma_proof::nic_receive_poll). capOS mapping:- Arm. On first
receivePollthe kernel allocatesNIC_RX_POOL_SIZEmanager-owned bounce RX frames (frame::alloc_frame_zeroed), posts one device-writable descriptor + avail entry per frame on the retained RX vring, publishesavail.idx, and rings the RX doorbell. The device masters only into these kernel-private pages; no host-physical or device-usable address is exported (host_physical_user_visible = 0). - Drain one per poll. Each
receivePollre-kicks the RX doorbell (so QEMU flushes a queued inbound frame into an armed buffer during the MMIO VM exit) and reads the RX used ring. If it advanced, the kernel copies the frame out into the inlineDatareply (bounded by the posted buffer length) and recycles that bounce slot. - The per-buffer invariant replacing reset-before-reclaim. A bounce slot is
re-exposed to the device only after its copy-out completes and its slot
generation is bumped, with the slot scrubbed before the re-post – the
production handle-epoch slot identity (
docs/dma-isolation-design.md) applied at recycle granularity instead of device-reset granularity. The device is not reset per frame (device_reset=none); teardown (on_releasevianic_quiesce, or an unprovable in-flight-DMA error) still quiesces (reset + queues cleared) and scrubs + frees the pool. - No frame yet. If the used ring did not advance,
receivePollreturnsframePresent = falsewith no reset and the device stays armed (device_armed=true) – the cheap speculative poll asmoltcpphy::DeviceRX token needs.receive @1semantics are unchanged. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-sustained-receive-pool(onecloudboot-evidence: nic-driver-userspace-sustained-receive-pool <token>marker after draining more than one frame with at least one non-resetting empty poll). No GCE resources are created. - Follow-up work: DHCP/IPv4 configuration consumes the served socket path;
later cleanup removes or fixture-gates the legacy kernel socket path and
kernel
smoltcp/ virtio-net hot path. The 7c-ii(b) production manifest proof now consumes the userspace-servedTcpListenAuthority,TcpListener, andTcpSocketsubstrate locally.
- Arm. On first
-
Production cloud-boot evidence marker (
network-stack-process-smoltcp-skeleton): Phase C slice 7a (first increment, DONE 2026-06-03) is the first time a real TCP/IP stack runs outside the kernel over the relocated NIC authority. A userspace network-stack process builds ansmoltcpInterface(Ethernet medium, MAC fromNic.macAddress, static IPv410.0.2.15/24) over aphy::Deviceadapter whose RX/TX is the slice-6 independentNic.receive/Nic.transmit, clocked by theTimercap monotonic source (monotonic_ns). capOS mapping:- The
phy::Deviceadapter is buffered: outbound framessmoltcpproduces queue in a process-localVecthat the poll loop drains and submits viaNic.transmit; one inbound frame fetched viaNic.receiveis handed back forsmoltcpto consume. The adapter holds no vring, DMA handle, or host-physical address – every frame is a process-local byte buffer crossing the cap boundary as inlineDatathrough the manager-owned bounce page (host_physical_user_visible = 0). - The proof is that
smoltcp– not hand-rolled frame code – drives the exchange: a UDP datagram queued to the on-link gateway makessmoltcpemit an ARP request (out throughNic.transmit), the SLIRP ARP reply is consumed (in throughNic.receive, EtherType 0x0806), and – with the neighbour now resolved –smoltcpemits the queued IPv4/UDP datagram, so the neighbour cache observably advances (smoltcp_tx_arp>=1,smoltcp_rx_consumed>=1,smoltcp_tx_ipv4>=1). The internalsmoltcpUDP socket is only an egress stimulus; no socket capability is exposed. - Implementation note: the landed
Niccap rides on the userspace driver shim’s retained vring (the kernel does not own the vring), so the skeleton process performs the slice-1-6 bring-up itself before runningsmoltcp. Relocating the bring-up to a separate long-lived NIC-driver service is folded into the slice-7c contract relocation. - Out of scope: the socket caps (slice 7b), the
cap/network.rscontract relocation (slice 7c,virtio_stub.rsstays fail-closed), and the kernelsmoltcp/ virtio-net removal (slice 8). - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-network-stack-process-smoltcp-skeleton(onecloudboot-evidence: network-stack-process-smoltcp-skeleton <token>marker). No GCE resources are created.
- The
-
Production cloud-boot evidence marker (
network-stack-smoltcp-socket-caps): Phase C slice 7b (DONE 2026-06-03) adds a userspaceUdpSocketcap layer on top of the slice-7a substrate: the userspace network-stack process now implements theUdpSocketschema’ssendTo/recvFromsemantics (UdpSocketCapLayer) over the samesmoltcpInterfaceand proves one bounded UDP request/response through it. capOS mapping:- The socket layer drives the slice-7a
phy::Device/Nicpump:sendToresolves the destination’s on-link ARP (oneNic.receivefor the guaranteed ARP reply, EtherType 0x0806) and transmits the datagram throughNic.transmit;recvFromfetches the single solicited reply datagram throughNic.receive(EtherType 0x0800) and returns it. Frames stay process-local byte buffers (host_physical_user_visible = 0); a queue-address read stays refused. - The request/response is a DNS A query for
example.comto SLIRP’s built-in resolver at10.0.2.3:53(the same resolver the Cposix-dns-resolversmoke uses); the decoded response is returned throughrecvFromand the proof asserts source10.0.2.3:53, the transaction-id/QR/RCODEcorrelation, and a decoded A record. The landedNic.receiveresets the device on an empty poll, so the proof only receives when a reply is guaranteed pending and spaces aTimerpre-delay before the datagram receive. - Honest boundary: the socket layer is in-process – it implements the
socket interface semantics over the userspace stack but does not yet serve
them as inter-process transferable capabilities, and it does not touch
the production
kernel/src/cap/network.rscontract (virtio_stub.rsstays fail-closed). Preserving that contract behind a userspace network-stack service is slice 7c. - QEMU-emulable vs hardware-only: fully QEMU-emulable (relies on SLIRP’s DNS
forwarder). Proved locally by
make run-cloud-prod-network-stack-smoltcp-socket-caps(onecloudboot-evidence: network-stack-smoltcp-socket-caps <token>marker). No GCE resources are created.
- The socket layer drives the slice-7a
-
Production cloud-boot evidence marker (
userspace-network-stack-smoltcp): Phase C slice 7c, first increment (DONE 2026-06-03) serves the slice-7bUdpSocketCapLayeras a real inter-process transferable capability. capOS mapping:- A network-stack server process holds the bring-up caps plus an exported
Endpoint; after bring-up it serves theUdpSocketschema (sendTo/recvFrom/close) over that endpoint, driving the sameUdpSocketCapLayeron its own ring (decoding/encoding the capnp params and results). A separate client process holds onlyConsoleand the served cap; it re-interprets the importedEndpointas aUdpSocketand drives one bounded DNS A query/response through the productionUdpSocketClient. smoltcpstill moves every frame through theNiccap (ARP reply EtherType 0x0806 + DNS reply 0x0800 throughNic.receive);host_physical_user_visible = 0is preserved and a queue-address read stays refused. Onclosethe server releases its owned RXInterrupt(route_torn_down=ok).- Honest boundary: the
UdpSocketcontract lives behind a userspace network-stack service. Later Phase C increments added theTcpListener/TcpSocketsubstrate, inter-process serving, and the local 7c-ii(b) serve-from-userspace manifest proof forTcpListenAuthority. DHCP/IPv4, Web UI L4, private GCE reachability, public ingress, and legacy kernel-socket cleanup remain separate work. - QEMU-emulable vs hardware-only: fully QEMU-emulable (relies on SLIRP’s DNS
forwarder). Proved locally by
make run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc(onecloudboot-evidence: userspace-network-stack-smoltcp <token>marker). No GCE resources are created.
- A network-stack server process holds the bring-up caps plus an exported
-
Production cloud-boot evidence marker (
virtio-net-tx-authority-bundle): under the focused-proof Cargo featurecloud_virtio_net_tx_authority_bundle_proof, the cloudboot kernel layers a bundle observer (cap::virtio_net_tx_authority_bundle_proof) on top of the three existing production grant sources (devicemmio_grant_source_prod,dmapool_grant_source_prod,interrupt_grant_source_prod). Under the feature, the DeviceMmio source filters its PCI candidate to the same virtio/NVMe-class function the DMAPool and Interrupt sources already match, so all three grants land on the same virtio-net function. Exposed throughmake run-cloud-provider-virtio-net-tx-authority-bundle.- Implemented wire-format subset: no new MMIO/DMA/IRQ writes. The
bundle reuses the existing prod sources’ grant + per-cap on-release
surfaces and asserts the bundle identity over their issue and release
notifications via the
record_devicemmio_grant/record_dmapool_grant/record_interrupt_grant/record_devicemmio_release/record_dmapool_release/record_interrupt_releasehooks called from the existing build_cap_for_grant / on_release / release_cap paths. - Fail-closed assertions: the userspace
cloud-provider-virtio-net-tx-authority-bundle-smokeservice callsinfoon each of the three caps and asserts they all report the same BDF. The kernel-side bundle observer records each grant’s(bdf, generation)identity at issue and at release; the headlinecloudboot-evidence: virtio-net-tx-authority-bundle <token>marker is emitted only after all three caps have been issued and released andsame_dm/same_dp/same_ir/same_bdfall evaluate true. A BDF mismatch logsvirtio-net-tx-authority-bundle: assertion regression: ...and leaves the marker unprinted. Per-cap stale-handle fail-closed is inherited from the existing prod sources’validate_*_recordpaths; the smoke re-tests it explicitly after each release. - capOS mapping: bundle authority composition over the
DeviceMmio+DMAPool+Interruptgrant arms; first child of the blockedcloud-prod-virtio-net-userspace-provider-tx-local-proofparent. No virtio queue setup, no descriptor publication, no notify doorbell, no PCI function-level MSI-X enable, noInterrupt.wait, no TX completion claim, no live cloud traffic. The marker’s trailing labels (same_bdf=true,queue_setup=not-attempted,tx_descriptor=not-published,notify=not-rung,msix_function_enable=not-toggled,tx_completion=not-claimed,live_cloud=not-attempted) re-anchor those bounds. The bundle feature is mutually exclusive withqemu,cloud_provider_cap_waiter_proof, andcloud_virtio_net_device_bringup_proofat thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally
by
make run-cloud-provider-virtio-net-tx-authority-bundle. No GCE resources are created.
- Implemented wire-format subset: no new MMIO/DMA/IRQ writes. The
bundle reuses the existing prod sources’ grant + per-cap on-release
surfaces and asserts the bundle identity over their issue and release
notifications via the
-
Production cloud-boot evidence marker (
virtio-net-tx-queue-materialization): under the focused-proof Cargo featurecloud_virtio_net_tx_queue_materialization_proof, the cloudboot kernel runscap::virtio_net_tx_queue_materialization_proof(kernel/src/cap/virtio_net_tx_queue_materialization_proof.rs) over the same virtio-net function the authority bundle picks. The proof materializes one manager-owned TX virtqueue: it allocates three zeroed physical frames from the kernel frame allocator, programs the TX queue’s common-configurationQUEUE_DESC/QUEUE_DRIVER/QUEUE_DEVICE+QUEUE_ENABLE = 1, asserts the device read-backs match the manager-authored host-physical addresses, then writes 0 todevice_statusand asserts every TX queue-state register has cleared to 0. Exposed throughmake run-cloud-provider-virtio-net-tx-queue-materialization.- Spec basis: virtio 1.2 §2.1.2 (reset clears all virtqueue state), §2.7 (split-ring queue layout), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: the proof drives the modern virtio
status sequence through reset / ACK / DRIVER / feature select
(
VIRTIO_F_VERSION_1only) / FEATURES_OK, assertsCOMMON_NUM_QUEUES >= 2, writesCOMMON_QUEUE_SELECT = 1(TX), readsCOMMON_QUEUE_SIZE, clamps to a power-of-two bound (MAX_QUEUE_SIZE = 256, so each region fits in one 4 KiB frame), allocates desc/avail/used frames throughmem::frame::alloc_frame_zeroed, programsCOMMON_QUEUE_DESC/COMMON_QUEUE_DRIVER/COMMON_QUEUE_DEVICEwith the resulting host-physical addresses +COMMON_QUEUE_ENABLE = 1, reads every queue register back through theMmioRegionaccessors (the proof grew aread_u64companion to the existingwrite_u64) and asserts the values match, setsDRIVER_OK, then writes 0 todevice_statusand asserts post-resetCOMMON_QUEUE_ENABLE/..._DESC/..._DRIVER/..._DEVICEare all 0 after re-selecting queue 1. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-desc.<hex>-drv.<hex>-dev.<hex>. - Fail-closed assertions: five inline assertions gate the marker.
(1) Initial reset reads back as 0. (2) Negotiated feature set
matches exactly
VIRTIO_F_VERSION_1. (3) Post-DRIVER_OK status reads back withACK|DRIVER|FEATURES_OK|DRIVER_OKset andFAILEDclear. (4) Programmed queue addresses + enable read back exactly as written. (5) Post-reset re-read of the TX queue state reports every queue-state register cleared to 0. The proof wraps the materialization so every exit path (success or any intermediate failure) writes 0 todevice_statusand frees every allocated frame back to the bitmap before returning. Per- stage outcomes log on thevirtio-net-tx-queue-materialization: ok .../... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of the TX authority bundle
that extends the proven bundle composition with one round of
real common-configuration queue setup + reset cleanup. The same
boot still spawns the
cloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; the bundle observer compiles in (the picker filteris_bundle_candidate_classfires under either feature) so every grant + release identity still pairs up for the debug trail, but the bundle’s headline marker is intentionally suppressed under this feature because itsqueue_setup=not-attemptedclaim would be inaccurate now. The queue-materialization marker’s trailing labels (tx_descriptor=not-published,notify=not-rung,msix_function_enable=not-toggled,tx_completion=not-claimed,provider_visible_queue_address=hidden,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (descriptor publication, notify doorbell, MSI-X function enable, userspace submit, used-ring polling, live cloud) carry. Thecap::virtio_net_tx_queue_materialization_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally
by
make run-cloud-provider-virtio-net-tx-queue-materialization. No GCE resources are created.
-
Production cloud-boot evidence marker (
virtio-net-rx-queue-materialization): under the focused-proof Cargo featurecloud_virtio_net_rx_queue_materialization_proof, the cloudboot kernel runscap::virtio_net_rx_queue_materialization_proof(kernel/src/cap/virtio_net_rx_queue_materialization_proof.rs) over the same virtio-net function the authority bundle picks. It is the structural mirror of the TX queue-materialization proof, one virtqueue index over: it materializes one manager-owned RX virtqueue (queue index 0) instead of the TX virtqueue (queue index 1). The proof allocates three zeroed physical frames from the kernel frame allocator, programs the RX queue’s common-configurationQUEUE_DESC/QUEUE_DRIVER/QUEUE_DEVICE+QUEUE_ENABLE = 1, asserts the device read-backs match the manager-authored host-physical addresses, then writes 0 todevice_statusand asserts every RX queue-state register has cleared to 0. Exposed throughmake run-cloud-provider-virtio-net-rx-queue-materialization.- Spec basis: virtio 1.2 §2.1.2 (reset clears all virtqueue state), §2.7 (split-ring queue layout), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: identical to the TX
queue-materialization proof except it writes
COMMON_QUEUE_SELECT = 0(RX) instead of1(TX) and re-selects queue 0 for the post-reset read-back. The proof drives the modern virtio status sequence through reset / ACK / DRIVER / feature select (VIRTIO_F_VERSION_1only) / FEATURES_OK, assertsCOMMON_NUM_QUEUES >= 2, readsCOMMON_QUEUE_SIZE, clamps to a power-of-two bound (MAX_QUEUE_SIZE = 256, so each region fits in one 4 KiB frame), allocates desc/avail/used frames throughmem::frame::alloc_frame_zeroed, programsCOMMON_QUEUE_DESC/COMMON_QUEUE_DRIVER/COMMON_QUEUE_DEVICE+COMMON_QUEUE_ENABLE = 1, reads every queue register back through theMmioRegionaccessors and asserts the values match, setsDRIVER_OK, then writes 0 todevice_statusand asserts post-resetCOMMON_QUEUE_ENABLE/..._DESC/..._DRIVER/..._DEVICEare all 0 after re-selecting queue 0. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-desc.<hex>-drv.<hex>-dev.<hex>(withq.0for RX). - Fail-closed assertions: the same five inline assertions gate the
marker as in the TX proof — initial reset reads back 0; negotiated
feature set is exactly
VIRTIO_F_VERSION_1; post-DRIVER_OK status hasACK|DRIVER|FEATURES_OK|DRIVER_OKset andFAILEDclear; programmed queue addresses + enable read back exactly as written; post-reset re-read of the RX queue state reports every queue-state register cleared to 0. The proof wraps the materialization so every exit path (success or any intermediate failure) writes 0 todevice_statusand frees every allocated frame back to the bitmap before returning. Per-stage outcomes log on thevirtio-net-rx-queue-materialization: ok .../... failed closed: ...lines. - capOS mapping: focused-proof sibling of the TX
queue-materialization proof that drives the same kernel-side queue
setup + reset cleanup against the receive virtqueue. The same boot
still spawns the
cloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; the bundle observer compiles in through its shared cfg gate so every grant + release identity still pairs up for the debug trail, but the bundle’s headline marker is intentionally suppressed under this feature (it is gated oncloud_virtio_net_tx_authority_bundle_proof, which this feature does not enable) because itsqueue_setup=not-attemptedclaim would be inaccurate now. The RX-queue-materialization marker’s trailing labels (rx_buffer=not-posted,avail=not-published,notify=not-rung,rx_completion=not-claimed,msix_function_enable=not-toggled,provider_visible_queue_address=hidden,iova_export=disabled-future-only,device_autonomous_raise=not-claimed,live_cloud=not-attempted) re-anchor the bounds the descendant slices (receive-buffer post, avail publication, notify doorbell, used-ring consumption, MSI-X function enable, live cloud) carry. Thecap::virtio_net_rx_queue_materialization_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof,cloud_virtio_net_tx_authority_bundle_proof, and every TX proof feature at thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally
by
make run-cloud-provider-virtio-net-rx-queue-materialization. No GCE resources are created. Live-GCE RX stays undercloud-gcp-virtio-net-nic-driver.
-
Production cloud-boot evidence marker (
virtio-net-rx-buffer-post): under the focused-proof Cargo featurecloud_virtio_net_rx_buffer_post_polled_completion_proof(which impliescloud_virtio_net_rx_queue_materialization_proofso the bundle observer- production grant-source pickers + userspace bundle smoke keep their
plumbing), the cloudboot kernel runs
cap::virtio_net_rx_buffer_post_polled_completion_proof(kernel/src/cap/virtio_net_rx_buffer_post_polled_completion_proof.rs) over the same virtio-net function the authority bundle picks. It is the RX analogue of the TX submit-doorbell -> polled-completion progression: it materializes the RX virtqueue (queue index 0) AND the TX virtqueue (queue index 1, the SLIRP stimulus path), setsDRIVER_OK, posts ONE manager-owned device-writable receive buffer to the RX avail ring, rings the RX notify doorbell once, fills and TX-submits ONE broadcast ARP request as the SLIRP stimulus, rings the TX notify doorbell once, then polls the manager-owned RX used ring with a bounded spin budget untilused.idx == 1and assertsused[0].id == 0andused[0].len > 0— ONE real device->host RX DMA landed in the manager-owned bounce page. Exposed throughmake run-cloud-provider-virtio-net-rx-buffer-post.
- Spec basis: virtio 1.2 §2.1.2 (reset clears virtqueue state), §2.7
(split-ring queue layout), §2.7.6 (available ring:
flags @0,idx @2,ring @4), §2.7.7 (VIRTQ_AVAIL_F_NO_INTERRUPT), §2.7.8 (used ring layout), §4.1.4.3 (common configuration queue registers), §4.1.5.2 (notify doorbell), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1), §5.1.6 (12-byte modern virtio-net header). - Implemented wire-format subset: stages 1-9 materialize the RX queue
(index 0) and the TX queue (index 1) identically to the queue-
materialization proof (modern status sequence to FEATURES_OK,
VIRTIO_F_VERSION_1only,COMMON_NUM_QUEUES >= 2, clampCOMMON_QUEUE_SIZEto a power of two<= MAX_QUEUE_SIZE = 256, allocate desc/avail/used frames per queue, program the per-queue registers +QUEUE_ENABLE = 1, read-back assert), thenDRIVER_OK. The RX-DMA delta authors RX descriptor slot 0 over the HHDM (addr = rx_payload_phys,len = 2048,flags = VIRTQ_DESC_F_WRITE,next = 0), sets the RX avail ring (flags = VIRTQ_AVAIL_F_NO_INTERRUPT,ring[0] = 0, release fence,idx = 1), maps the modern notify region bounded to the smallest page covering both per-queue notify slots, and rings the RX-queue notify doorbell. The stimulus mirrorsvirtio.rs::write_arp_request_frame: it fills one TX payload frame with a broadcast ARP request for the SLIRP gateway IP (10.0.2.2), authors TX descriptor slot 0 (flags = 0, device-readable), sets the TX avail ring (alsoVIRTQ_AVAIL_F_NO_INTERRUPT), and rings the TX-queue notify doorbell. The proof then polls the RXused.idxwith a bounded spin budget (POLL_USED_RING_BUDGET = 50_000_000, an order of magnitude above the in-kernelARP_RX_POLL_LIMIT = 500_000) and reads the device-authoredused[0].(id, len)plus the observed EtherType. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-rxnotify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-rxdesc.<hex>-rxdrv.<hex>-rxdev.<hex>-rxpay.<hex>-rxlen.<u>-availidx.<u>-usedidx.<u>-usedid.<u>-usedlen.<u>-ethertype.<hex>. - Fail-closed assertions: the queue-materialization assertions gate
both queue setups (initial reset reads 0; negotiated features exactly
VIRTIO_F_VERSION_1; post-DRIVER_OKstatus hasACK|DRIVER|FEATURES_OK|DRIVER_OKset andFAILEDclear; programmed queue addresses + enable read back exactly as written). The RX-DMA delta adds: the polledused.idxreaches1within the spin budget (else fail closed, no marker); the post-completion RXavail.idxreads back1;used[0].id == 0(the published descriptor head);used[0].len > 0(a real device->host frame); and the post-reset re-read of both the RX and TX queue-state registers reports every register cleared to 0. The proof resets the device on every exit path (success or any intermediate failure) and frees the eight manager-owned frames (RX desc/avail/used, TX desc/avail/used, RX payload, TX payload) only after a confirmed reset read-back of 0; if reset cannot be confirmed the frames stay retained so the device cannot DMA into a freed page. Per-stage outcomes log on thevirtio-net-rx-buffer-post: ok .../... failed closed: ...lines. - capOS mapping: focused-proof child of the RX queue-materialization
proof that drives the first real RX DMA on the production cloud
kernel. Completion is polled only: MSI-X stays disabled, no MSI-X
table entry is programmed, no
Interruptwaiter is installed, no dispatch slot is claimed, and both avail rings carryVIRTQ_AVAIL_F_NO_INTERRUPTso the device does not raise a queue-completion interrupt. The same boot still spawns thecloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; both companion headline markers (virtio-net-tx-authority-bundleandvirtio-net-rx-queue-materialization) are intentionally suppressed under this feature because this proof is the new headline owner. The marker’s trailing labels (rx_buffer=posted,avail=published,notify=rung-once,rx_completion=polled-used-ring,msix_rx_function_enable=not-toggled,msix_table_write=not-performed,device_autonomous_raise=not-claimed,provider_visible_queue_address=hidden,provider_rx_submit=kernel-proxy-bounded,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (RX MSI-X wait/ack, provider-driven RX submit, live cloud) carry. Thecap::virtio_net_rx_buffer_post_polled_completion_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof,cloud_virtio_net_tx_authority_bundle_proof, and every TX proof feature at thecap::mod.rsactivation site (inherited through the implied RX-materialization feature’scompile_error!s). - QEMU-emulable vs hardware-only: fully QEMU-emulable; the SLIRP
-netdev userbackend delivers the ARP reply that drives the RX DMA. Proved locally bymake run-cloud-provider-virtio-net-rx-buffer-post. No GCE resources are created. Live-GCE RX stays undercloud-gcp-virtio-net-nic-driver.
- production grant-source pickers + userspace bundle smoke keep their
plumbing), the cloudboot kernel runs
-
Production cloud-boot evidence marker (
virtio-net-msix-function-enable): under the focused-proof Cargo featurecloud_virtio_net_msix_function_enable_proof(which impliescloud_virtio_net_tx_queue_materialization_proofso the bundle observer- production grant-source pickers + userspace bundle smoke keep their
plumbing), the cloudboot kernel runs
cap::virtio_net_msix_function_enable_proof(kernel/src/cap/virtio_net_msix_function_enable_proof.rs) over the same virtio-net function the authority bundle and queue-materialization proofs pick. The proof re-drives the modern virtio status sequence toDRIVER_OK, materializes one manager-owned TX virtqueue (identical to the queue-materialization proof), then walks the PCI MSI-X capability mask-first: it reads the Message Control register, writesFUNCTION_MASK = 1first, reads back, writesENABLE = 1while keeping the function mask set, reads back, then cleans up by clearing both bits and reads back to assert PCI config-space MSI-X state is restored. Exposed throughmake run-cloud-provider-virtio-net-msix-function-enable.
- Spec basis: PCI SIG MSI-X §6.8.2 Message Control Register (bits 14 = Function Mask, 15 = MSI-X Enable); virtio 1.2 §2.1.2 (reset clears virtqueue state), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: stages 1-11 mirror the queue-
materialization proof. Stages 12-15 are this proof’s delta:
pci::interrupt_capabilities+MsixCapabilityInfo.offsetresolve the capability header;pci::try_read_config_u16reads the Message Control register atcapability_offset + 0x02; the proof asserts the pre-state hasMSIX_CONTROL_ENABLEclear, performs the mask-first write throughpci::try_write_config_u16, reads back and assertsFUNCTION_MASK = 1, ENABLE = 0, performs the enable write keeping the mask, reads back and asserts both bits are set, then performs the cleanup write that clears both bits, reads back and asserts both are clear. The proof never programs an MSI-X table entry, never claims an interrupt-dispatch slot, and never raises a device-autonomous interrupt. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>. - Fail-closed assertions: stages 1-11 inherit the queue-
materialization proof’s five inline assertions. The MSI-X delta adds
four more. (6) Pre-state read-back has
MSIX_CONTROL_ENABLEclear. (7) Post-mask-write read-back hasFUNCTION_MASK = 1andENABLE = 0. (8) Post-enable-write read-back has both bits set. (9) Post-cleanup-write read-back has both bits clear. Every exit path (success or any intermediate failure) runs a best-effortpci::try_write_config_u16that clearsMSIX_CONTROL_ENABLEandMSIX_CONTROL_FUNCTION_MASKregardless of the result chain, then writes 0 todevice_statusand frees every allocated queue frame. Per-stage outcomes log on thevirtio-net-msix-function-enable: ok .../... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of the TX queue
materialization proof that extends the same kernel-side
activation surface with one round of canonical mask-first MSI-X
function-level enable + cleanup. The same boot still spawns the
cloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; both companion headline markers (virtio-net-tx-authority-bundleandvirtio-net-tx-queue-materialization) are intentionally suppressed under this feature because theirqueue_setup=not-attempted/msix_function_enable=not-toggledclaims would be inaccurate now, so the MSI-X function-enable marker is the sole headline. The marker’s trailing labels (tx_descriptor=not-published,notify=not-rung,msix_function_enable=toggled-mask-first,msix_function_enable_cleanup=cleared,msix_table_write=not-performed,device_autonomous_raise=not-claimed,tx_completion=not-claimed,provider_visible_queue_address=hidden,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (interrupt-dispatch slot, descriptor publication, used-ring polling, live cloud) carry. Thecap::virtio_net_msix_function_enable_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-msix-function-enable. No GCE resources are created.
- production grant-source pickers + userspace bundle smoke keep their
plumbing), the cloudboot kernel runs
-
Production cloud-boot evidence marker (
virtio-net-tx-submit-doorbell): under the focused-proof Cargo featurecloud_virtio_net_tx_submit_doorbell_proof(which impliescloud_virtio_net_msix_function_enable_proofand transitivelycloud_virtio_net_tx_queue_materialization_proofso the bundle observer + production grant-source pickers + userspace bundle smoke + mask-first MSI-X plumbing all keep firing), the cloudboot kernel runscap::virtio_net_tx_submit_doorbell_proof(kernel/src/cap/virtio_net_tx_submit_doorbell_proof.rs) over the same virtio-net function the authority bundle, queue-materialization, and MSI-X function-enable proofs pick. The proof re-drives the modern virtio status sequence toDRIVER_OK, materializes one manager-owned TX virtqueue, enables MSI-X function-level control mask-first, then allocates one brokered TX payload frame and fills it kernel-half as a proxy for the userspace provider’s brokered fill, writes one TX descriptor at slot 0 of the descriptor ring, publishes one avail-ring entry and advancesavail.idxto 1, maps the modern virtio notify region, rings the notify doorbell exactly once for the selected TX queue, reads the post-doorbellavail.idxand device-used.idxfor visibility, then cleans up MSI-X mask-first, resets the device, and frees all four manager-owned frames. Exposed throughmake run-cloud-provider-virtio-net-tx-submit-doorbell.- Spec basis: virtio 1.2 §2.7.6 (driver-area / available ring layout
including
idxat +2 and ring slots at +4), §2.7.8 (device-area / used ring layout includingidxat +2), §4.1.4.4 (notify-cfg capability and per-queue notify address resolution asnotify_bar_base + cap.bar_offset + queue_notify_off * notify_off_multiplier), §4.1.5.2 (modern virtio doorbell: u16 write of the queue index to the per-queue notify address), §5.1.6.2 (virtio-net TX descriptor layout). The submit ordering follows virtio 1.2 §2.7.13 (driver writes the descriptor head index toavail.ring[avail.idx % size], then bumpsavail.idxafter a suitable memory barrier). - Implemented wire-format subset: stages 1-14 mirror the MSI-X
function-enable proof (status sequence, queue materialization,
mask-first MSI-X enable). Stages 15-21 are this proof’s submit
/doorbell delta:
frame::alloc_frame_zeroedallocates one payload frame,frame::hhdm_offsettranslates the manager-owned host- physical to a kernel virtual address for the kernel-proxy fill (a minimal 12-byte modern virtio-net header followed by an 8-byteb"CAPOSTX1"body, total payload length 20 bytes); slot 0 of the descriptor ring receivesaddr = payload_phys, len = 20, flags = 0, next = 0over the HHDM write;avail.ring[0] = 0andavail.idx = 1over the HHDM with a compiler fence between them; the notify region is mapped throughpci::map_bar_regionand the kernel writes the queue index1as a u16 tonotify_vaddr + queue_notify_off * notify_off_multiplier;avail.idxis read back and asserted as1, and the device-writtenused.idxis read for visibility only. The proof never polls the used ring beyond the single visibility read, never claims a TX completion, never programs an MSI-X table entry, never raises a device-autonomous interrupt, never registers anInterruptwaiter, never performs direct DMA, never programs the IOMMU, and never exports a host-physical address or IOVA. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>-notify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-desc.<hex>-payload.<hex>-paylen.<u>-availidx.<u>-usedidx.<u>. - Fail-closed assertions: stages 1-14 inherit the MSI-X function-
enable proof’s nine inline assertions. The submit/doorbell delta
adds three more. (10) Notify region length must be large enough to
contain
queue_notify_off * notify_off_multiplier + 2. (11) Notify-region map length must cover that minimum. (12) Post- doorbellavail.idxround-trip must read back as1. Every exit path (success or any intermediate failure) runs the best-effort MSI-X cleanup, writes 0 todevice_status, asserts every TX queue- state register cleared to 0, and frees all four manager-owned frames (descriptor, avail, used, payload) regardless of the result chain. The device-used.idxread is deliberately NOT asserted: QEMU may or may not have drained the descriptor by the time the kernel reads it, and the proof’s discipline saystx_completion=not- claimedregardless of the observed value. Per-stage outcomes log on thevirtio-net-tx-submit-doorbell: ok .../... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of the MSI-X function-enable
proof that extends the same kernel-side activation surface with one
round of single-slot descriptor publish + single avail-ring entry +
single notify doorbell ring, with no used-ring polling or
completion claim. The same boot still spawns the
cloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; all three companion headline markers (virtio-net-tx-authority-bundle,virtio-net-tx-queue-materialization, andvirtio-net-msix-function-enable) are intentionally suppressed under this feature because theirtx_descriptor=not-published/notify=not-rung/queue_setup=not-attempted/msix_function_enable=not-toggledclaims would all be inaccurate now, so the submit/doorbell marker is the sole headline. The marker’s trailing labels (tx_descriptor=published,notify=rung-once,msix_function_enable=toggled-mask-first,msix_function_enable_cleanup=cleared,msix_table_write=not-performed,device_autonomous_raise=not-claimed,tx_completion=not-claimed,provider_visible_queue_address=hidden,provider_fill=kernel-proxy-bounded,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (used-ring polling, provider waiter/ack, interrupt-dispatch slot claim, live cloud) carry. Thecap::virtio_net_tx_submit_doorbell_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-tx-submit-doorbell. No GCE resources are created.
- Spec basis: virtio 1.2 §2.7.6 (driver-area / available ring layout
including
-
Kernel-half TX polled-completion proof (predecessor of
virtio-net-userspace-provider): under the focused-proof Cargo featurecloud_virtio_net_tx_polled_completion_proof(which impliescloud_virtio_net_tx_submit_doorbell_proofand transitivelycloud_virtio_net_msix_function_enable_proof/cloud_virtio_net_tx_queue_materialization_proofso every shared plumbing gate keeps firing), the cloudboot kernel runscap::virtio_net_tx_polled_completion_proof(kernel/src/cap/virtio_net_tx_polled_completion_proof.rs) over the same virtio-net function the authority bundle, queue-materialization, MSI-X function-enable, and submit/doorbell proofs pick. The proof re-drives the modern virtio status sequence toDRIVER_OK, materializes one manager-owned TX virtqueue, enables MSI-X function-level control mask-first, allocates one brokered TX payload frame and fills it kernel-half as a proxy for the userspace provider’s brokered fill, publishes one TX descriptor + one avail-ring entry over the manager-owned ring frames, rings the notify doorbell exactly once for the selected TX queue, polls the device-authoredused.idxfrom the manager-owned used-ring frame with a bounded retry budget until it reaches1(one consumed TX descriptor), reads the post-completionavail.idxand the device-authoredused[0].(id, len), then cleans up MSI-X mask-first, resets the device, and frees all four manager-owned frames. The module is the predecessor of the userspace-submit polled-completion proof and is dropped from the compile set when the live-publish feature is on (the live-publish proof is the new headline owner ofvirtio-net-userspace-providerand exercises the same polled completion path through the userspace cap method instead of the kernel-half proxy).- Spec basis: inherits the submit/doorbell proof’s basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13).
The polled-completion delta uses §2.7.8 (used-ring layout:
idxat +2 and 8-byte(id, len)slots at +4) for both the boundedused.idxpoll and the device-authoredused[0]slot read. - Implemented wire-format subset: stages 1-20 mirror the
submit/doorbell proof (status sequence, queue materialization,
mask-first MSI-X enable, payload kernel-proxy fill, descriptor
publish, avail bump, notify doorbell ring). Stages 21-23 are this
proof’s polled-completion delta: the manager-owned used-ring
used.idxHHDM read is wrapped in a bounded retry loop (withcore::hint::spin_loop()between iterations) that converges on the target completion count1, the post-completionavail.idxHHDM round-trip is asserted as1, and the device-authoredused[0].id/used[0].lenare read with anAcquirecompiler fence on the success path so the slot data is observed consistently with theused.idxbump. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>-notify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-desc.<hex>-payload.<hex>-paylen.<u>-availidx.<u>-usedidx.<u>-polled.iter.<u>-usedid.<u>-usedlen.<u>. - Fail-closed assertions: stages 1-20 inherit the submit/doorbell
proof’s twelve inline assertions. The polled-completion delta adds
three more. (13) The bounded
used.idxpoll must converge on the target1within the retry budget; budget exhaustion fails closed and reports the last observed value. (14) The post-completionavail.idxHHDM round-trip must still read back as1. (15) The device-authoredused[0].idmust equal the published descriptor head index0;used[0].lenis recorded for visibility but is deliberately NOT asserted (virtio-net leaveslenat0for TX device-readable chains, but the kernel does not gate the proof on that). Every exit path (success or any intermediate failure) runs the best-effort MSI-X cleanup, writes 0 todevice_status, asserts every TX queue-state register cleared to 0, and frees all four manager-owned frames (descriptor, avail, used, payload) only after the final reset read-back is confirmed; if reset cannot be confirmed the frames stay retained rather than being returned while the device may still DMA them. Per-stage outcomes log on thevirtio-net-userspace-provider: ok .../... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of the submit/doorbell proof
that extends the same kernel-side activation surface with one
round of bounded
used.idxpolling + one accounted completion + one device-authoredused[0]slot read, paired with the userspace bundle smoke’sInterruptcap handle-lifecycle discipline on the same MSI-X BDF (Interrupt.inforound-trip identity assertion +release+ post-releaseInterrupt.infofail-closed). That cap-side pairing covers cap-handle identity and post-release stale-handle rejection on the productionInterruptcap; it deliberately does NOT exerciseInterrupt.wait/acknowledge, because the productionInterruptCapProd::waitandInterruptCapProd::acknowledgepaths are unimplemented in the non-qemucloud kernel and fail closed (kernel/src/cap/interrupt_prod.rs). Real waiter/ack pairing on the virtio-net TX MSI-X route is deferred to a future child that either ports thecap::provider_cap_waiter_proofkernel-injected-dispatch + deferred-EOI discipline onto this route or programs an actual MSI-X table entry + dispatch slot. All four companion headline markers (virtio-net-tx-authority-bundle,virtio-net-tx-queue-materialization,virtio-net-msix-function-enable, andvirtio-net-tx-submit-doorbell) are intentionally suppressed under this feature because theirtx_descriptor=not-published/notify=not-rung/queue_setup=not-attempted/msix_function_enable=not-toggled/tx_completion=not-claimedclaims would all be inaccurate now, so the polled-completion marker is the sole headline. The marker’s trailing labels (tx_descriptor=published,notify=rung-once,msix_function_enable=toggled-mask-first,msix_function_enable_cleanup=cleared,msix_table_write=not-performed,device_autonomous_raise=not-claimed,tx_completion=polled-used-ring,provider_visible_queue_address=hidden,provider_fill=kernel-proxy-bounded,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (interrupt-dispatch slot claim, realInterrupt.waitwaiter, live cloud) carry. Thecap::virtio_net_tx_polled_completion_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. The marker keepsdevice_autonomous_raise=not-claimedbecause the proof never enables the per-vector MSI-X table entry, never registers anInterrupt.waitwaiter, and observes the completion strictly through the device-authored used-ring update. - QEMU-emulable vs hardware-only: predecessor-only. The active
make run-cloud-provider-virtio-netheadline target switched to the userspace-submit polled-completion proof below, which exercises the same polled-completion path through the userspace cap method and supersedes this kernel-half proxy.
- Spec basis: inherits the submit/doorbell proof’s basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13).
The polled-completion delta uses §2.7.8 (used-ring layout:
-
Production cloud-boot evidence marker (
virtio-net-userspace-provider): under the focused-proof Cargo featurecloud_virtio_net_tx_dmabuffer_live_publish_proof(which impliescloud_virtio_net_tx_polled_completion_proofand transitively the submit/doorbell, MSI-X function-enable, queue-materialization, and authority-bundle proofs so every shared plumbing gate stays compiled in), the cloudboot kernel runscap::virtio_net_tx_dmabuffer_live_publish_proof(kernel/src/cap/virtio_net_tx_dmabuffer_live_publish_proof.rs) over the same virtio-net function the predecessor picks. Unlike the kernel-half polled-completion predecessor, the kernel-side proof here splits the work into two phases: at-bootinit()stages the modern virtio status sequence + TX queue materialization + MSI-X mask-first enable + notify mapping, leaving the device inDRIVER_OKwith MSI-X enabled-but-globally-masked; the per-callattempt_live_publishruns from the non-qemudevice-manager stub’svalidate_dmabuffer_submit_descriptor_admissionwhen the userspacecloud-provider-virtio-net-tx-dmabuffer-live-publish-smokeservice’sDMABuffer.submitDescriptoris admitted (queue == 1, descriptor_id == 0, length <= PAGE_SIZE, no user mapping live, no in-flight submit, kernel-knownDmaBufferHandle). The cap method resolves the buffer’s host-physical bounce-buffer page, authors one TX descriptor + avail-ring entry over the manager-owned ring frames, rings the notify doorbell exactly once, polls the device-authoredused.idxwith the same bounded budget the polled-completion predecessor uses, readsused[0].(id, len), tears the device down (MSI-X mask-first cleanup + device reset + queue-state register read-back asserted to zero, three manager-owned queue frames freed), and emits onecloudboot-evidence: virtio-net-userspace-provider <token>headline marker withprovider_fill=userspace-brokered-bufferanchoring the userspace-driven submit boundary. Exposed throughmake run-cloud-provider-virtio-net– the terminal local harness for the virtio-net userspace-provider chain.- Spec basis: inherits the polled-completion proof’s basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13).
The live-publish delta drives the same descriptor/avail/notify
write sequence and the same bounded
used.idxpoll, but the descriptor’saddrfield is the userspace-allocatedDMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger, not a manager-allocated payload frame. - Implemented wire-format subset: at-boot
init()covers stages 1-14 of the polled-completion sequence (status sequence + queue materialization + mask-first MSI-X enable + notify mapping) and stashes the staged state. Per-callattempt_live_publishcovers stages 15-24: descriptor publish (withdesc[0].addr = payload_physfrom the userspaceDMABuffer), avail-ring entry +avail.idxbump with a release compiler fence, notify doorbell ring, used-ringused.idxbounded poll,used[0]slot read with an acquire compiler fence, MSI-X cleanup, device reset, queue-state register read-back, and queue-frame release. Token grammar addspool.<u>.gen.<u>-buf.<u>.gen.<u>-payload.<hex>-paylen.<u>so the manager-issued single-slot bounce-buffer pool’s slot/generation pair, the buffer’s slot/generation pair, and the resolved payload host-physical address are observable from the marker; the polled-completion marker’sdesc.<hex>field is intentionally not in the live-publish marker because the per-call descriptor write happens after the marker emission window’s boundaries. - Fail-closed assertions: stages 1-14 inherit the polled-completion
proof’s assertions for status/queue/MSI-X bring-up. Stages 15-24
inherit its assertions for descriptor publish, doorbell, polled
completion, MSI-X cleanup, and reset. The per-call admission gate
adds five more, surfaced through the cap-side
DmaBufferSubmitDescriptorAdmissionshape: (1)queue != 1fails closed withdmabuffer-tx-queue-required/non-tx-queue-rejected(RX is rejected explicitly; queue >= 2 trips the standardqueue-out-of-rangerequest gate). (2)descriptor_id != 0fails closed withdescriptor-id-out-of-range. (3)length > PAGE_SIZEfails closed withlength-exceeds-buffer. (4) A live userspace VMA fails closed withdmabuffer-mapping-live(the cap-sideblock_submit_for_live_mappingshort-circuit handles this before the device-manager runs; the stub defends in depth). (5) A secondsubmitDescriptoron the same buffer without an interveningfreeBufferfails closed withdmabuffer-descriptor-already-inflight; the in-flight slot is dropped only when the parked-buffer record drops onfreeBuffer. A post-freeBuffersubmitDescriptorfails closed with the standard stale-handle error. - capOS mapping: terminal local headline that flips the
descriptor
addrsource from a manager-allocated kernel-proxy payload frame to the userspace-allocatedDMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger. The marker’s trailing labelprovider_fill=userspace-brokered-bufferreplaces the kernel-half polled-completion predecessor’sprovider_fill=kernel-proxy-boundedto reflect the change. All four companion headline markers (virtio-net-tx-authority-bundle,virtio-net-tx-queue-materialization,virtio-net-msix-function-enable, andvirtio-net-tx-submit-doorbell) are suppressed because the userspace-submit polled-completion proof is the new headline owner; the predecessorcap::virtio_net_tx_polled_completion_proofmodule is dropped from the compile set under this feature so its competing emission of the samevirtio-net-userspace-providermarker cannot fire. Thecap::virtio_net_tx_dmabuffer_live_publish_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. The marker keepsdevice_autonomous_raise=not-claimed,msix_table_write=not-performed, andlive_cloud=not-attemptedbecause the proof never enables the per-vector MSI-X table entry, never registers anInterrupt.waitwaiter, and observes the completion strictly through the device-authored used-ring update. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net. No GCE resources are created.
- Spec basis: inherits the polled-completion proof’s basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13).
The live-publish delta drives the same descriptor/avail/notify
write sequence and the same bounded
-
MSI-X wait/ack (
cap::virtio_net_tx_msix_wait_ack_proof): Carries the userspace-submit polled-completion delta one authority step further: same brokered TX submit boundary, but the userspace-observed completion event is the provider-cap-side wake from a kernel-injected dispatch on the bound virtio-net TX MSI-X route. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_tx_msix_wait_ack_proof(which impliescloud_virtio_net_tx_dmabuffer_live_publish_proofand its predecessors). The wait/ack proof’s at-bootinitruns after the live-publish proof’sinit: it registers + claims an MSI-X route on the same virtio-net BDF under theManagerGrantSourceowner, maps the MSI-X table BAR kernel-side, writes table entryPROOF_TABLE_ENTRY = 1mask-first per PCI 3.0 §6.8.2, attaches the route to the device manager, arms the deferred-LAPIC-EOI gate, and unmasks the route + entry. The PCI function-level MSI-X enable bit stays set with the function mask still asserted (held by the live-publish proof’s mask-first toggle), so the virtio-net device cannot autonomously raise an interrupt on the bound route. The cloudboot manifest spawns thecloud-provider-virtio-net-tx-msix-wait-ack-smokeuserspace service, which receives oneConsole+DeviceMmio+DMAPool+Interruptbundle (theInterruptsource resolves through the wait/ack proof’s grant source, replacinginterrupt_grant_source_produnder this feature), asserts theInterrupt.infoidentity + labels (bootstrap_grant=virtio-net-tx-msix-wait-ack-proof,wait=kernel-injected-dispatch-wait,acknowledge=kernel-injected-deferred-eoi-acknowledge), drives the same brokeredDMABuffer.submitDescriptorchain the predecessor exercises, then callsInterrupt.wait(the cap’sinvoke_waitrunsdevice_interrupt::handle_lapic_deliveryand returns one delivery withdelivery_count_after == delivery_count_before + 1plus one armed deferred LAPIC EOI), callsInterrupt.acknowledge(the cap retires the deferred LAPIC EOI throughacknowledge_deferred_lapic_eoi_for_route,ack_delta == 1,pending_after == 0), frees theDMABuffer, and releases theInterruptcap. The kernel-sideon_releasethen runs the masked-no-wake + reassign + stale-handle assertion chain on the bound route (mirroringcap::provider_cap_waiter_proof’s discipline) and emits exactly onecloudboot-evidence: virtio-net-userspace-provider <token>headline marker combining the publish outcome (recorded inPUBLISH_OUTCOMEby the predecessor’sattempt_live_publishwhen the feature is on) with the wait/ack delivery counts, the reassigned route generation, and the stale-handle / stale-token assertion booleans. The marker’s trailing labels differ from the polled-completion predecessor in two places:tx_completion=msix-wait-ack-injectedreplacestx_completion=polled-used-ring(the userspace-observed completion event is the cap-waiter dispatch; the polled used-ring still runs kernel-side as defence-in-depth), andmsix_table_write=performed-masked-firstreplacesmsix_table_write=not-performed(the wait/ack proof’sinitprogrammed one MSI-X table entry). All other discipline labels are preserved:device_autonomous_raise=not-claimed,provider_visible_queue_address=hidden,provider_fill=userspace-brokered-buffer,iova_export=disabled-future-only,live_cloud=not-attempted. The predecessor live-publish proof’s standalone marker emission is suppressed under this feature so the headline marker name cannot fire twice. Thecap::virtio_net_tx_msix_wait_ack_proofactivation site is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proof. Device-autonomous MSI-X delivery (programming the virtio queue’squeue_msix_vectorfor a hardware-raised TX completion interrupt and the broader production dispatch-slot proof), RX path, multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-tx-msix-wait-ack. No GCE resources are created.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
-
RX MSI-X wait/ack (
cap::virtio_net_rx_msix_wait_ack_proof): the RX analogue of the TX MSI-X wait/ack proof above. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_rx_msix_wait_ack_proof(which impliescloud_virtio_net_rx_buffer_post_polled_completion_proofand its predecessors). The RX completion is staged entirely kernel-side at boot: the RX buffer-post proof’sreport()stages the device throughDRIVER_OK, posts one manager-owned device-writable RX buffer, drives the ARP TX SLIRP stimulus, polls one real device->host RX DMA (used.idx == 1,used[0].len > 0), and – under this feature – additionally holds the PCI function-level MSI-X enable mask-first (hold_msix_function_enable_mask_first:FUNCTION_MASK = 1thenENABLE = 1, held, not cleaned up) and records the publish outcome into the wait/ack proof’sPUBLISH_OUTCOMEslot instead of emitting its standalonevirtio-net-rx-buffer-postheadline. The wait/ack proof’s at-bootinitthen drives the graduated always-builtcap::interrupt_programmed::program_attach_arm_unmaskover MSI-X table entry 0 (the RX queue’s per-queue config vector, virtio-pci §4.1.5.1.2) on the same virtio-net BDF under theManagerGrantSourceowner: register + claim + write table entry 0 mask-first per PCI 3.0 §6.8.2 + manager attach + deferred-LAPIC-EOI arm + route + entry unmask, tearing the route back down viateardownon any error or lost-init race. The device’s RXqueue_msix_vectorstaysVIRTIO_MSI_NO_VECTORand the function mask stays asserted, so the device cannot autonomously raise an interrupt on the bound route. The cloudboot manifest spawns thecloud-provider-virtio-net-rx-msix-wait-ack-smokeuserspace service, which receives oneConsole+Interruptbundle (the RX completion is staged kernel-side, so – unlike the TX wait/ack provider – it needs noDMAPool/DeviceMmiocap; theInterruptsource resolves through the wait/ack proof’s grant source, replacinginterrupt_grant_source_produnder this feature), asserts theInterrupt.infoidentity + labels (bootstrap_grant=virtio-net-rx-msix-wait-ack-proof,wait=kernel-injected-dispatch-wait,acknowledge=kernel-injected-deferred-eoi-acknowledge), callsInterrupt.wait(the cap’sinvoke_waitruns the graduateddevice_interrupt::wait_kernel_injected_dispatchand returns one delivery withdelivery_count_after == delivery_count_before + 1plus one armed deferred LAPIC EOI), callsInterrupt.acknowledge(the cap retires the deferred LAPIC EOI throughacknowledge_deferred_lapic_eoi_for_route,ack_delta == 1,pending_after == 0), and releases theInterruptcap. The kernel-sideon_releasethen runs the masked-no-wake + reassign + stale-handle assertion chain on the bound RX route and emits exactly onecloudboot-evidence: virtio-net-userspace-provider <token>headline marker combining the RX publish outcome with the wait/ack delivery counts, the reassigned route generation, and the stale-handle / stale-token assertion booleans. The marker’s trailing labels differ from the RX buffer-post predecessor in three places:rx_completion=msix-wait-ack-injectedreplacesrx_completion=polled-used-ring(the userspace-observed completion event is the cap-waiter dispatch; the polled used-ring still runs kernel-side as defence-in-depth),msix_rx_function_enable=toggled-mask-firstreplacesmsix_rx_function_enable=not-toggled(the staging now holds the function-level MSI-X enable mask-first), andmsix_table_write=performed-masked-firstreplacesmsix_table_write=not-performed(the wait/ack proof’sinitprogrammed one MSI-X table entry). All other discipline labels are preserved:device_autonomous_raise=not-claimed,provider_visible_queue_address=hidden,provider_rx_submit=kernel-proxy-bounded,iova_export=disabled-future-only,live_cloud=not-attempted. The predecessor RX buffer-post proof’s standalone marker emission is suppressed under this feature so the headline marker name cannot fire twice. Thecap::virtio_net_rx_msix_wait_ack_proofactivation site is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof,cloud_virtio_net_tx_authority_bundle_proof, and every TX/NVMe Interrupt-source proof feature. Device-autonomous RX MSI-X delivery (programming the virtio queue’s RXqueue_msix_vector), provider-driven RX submit, multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-rx-msix-wait-ack. No GCE resources are created.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
-
RX provider-driven buffer submit (
cap::virtio_net_rx_userspace_submit_proof): the RX analogue of the TX DMABuffer live-publish proof, carried one authority step past the RX MSI-X wait/ack proof above. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_rx_userspace_submit_proof(which impliescloud_virtio_net_rx_buffer_post_polled_completion_proofand its predecessors). Unlike the RX MSI-X wait/ack proof, the RX receive buffer is no longer a manager-owned bounce page filled kernel-side: it is the userspace provider’s brokeredDMABuffer, posted to the RX avail ring throughDMABuffer.submitDescriptor(queue=0). The feature drops the RX buffer-post module’s at-boot kernel-proxyreport(); instead this proof’s self-containedinitstages the device (status sequence + RX queue 0 + TX queue 1 materialization + held mask-first MSI-X function enable + notify map), allocates NO RX payload frame, and programs the kernel-injected RX MSI-X route over table entry 0 through the graduatedcap::interrupt_programmed::program_attach_arm_unmasksurface (same as the wait/ack proof). The cloudboot manifest spawns thecloud-provider-virtio-net-rx-userspace-submit-smokeuserspace service, which receives oneConsole+DeviceMmio+DMAPool+Interruptbundle (the RX provider, unlike the kernel-proxy RX wait/ack provider, needs theDMAPool/DeviceMmiocaps to allocate and submit its brokeredDMABuffer). The provider assertsInterrupt.infoidentity + labels (bootstrap_grant=virtio-net-rx-userspace-submit-proof), allocates one brokered bounce-bufferDMABuffer(NOT mapped or written before submit – the device is the RX writer), and callsDMABuffer.submitDescriptor(queue=0, descriptor_id=0, length=2048). The non-qemudevice-manager admission gate matches the parked bounce-buffer handle, validates the request shape (queue == 0, descriptor_id == 0, length <= PAGE_SIZE, no live user mapping, no in-flight submit), resolves the buffer’s kernel-known host-physical bounce-buffer page, and drivesattempt_rx_submit: it authors the RXdesc[0] = (provider_buffer_phys, length, flags=VIRTQ_DESC_F_WRITE, next=0)+ avail-ring entry over the manager-owned RX ring frames, rings the RX notify doorbell once, drives the ARP TX SLIRP stimulus (kernel-half), polls one real device->host RX DMA (used.idx == 1,used[0].len > 0), reads the observed EtherType, resets the device, and frees the manager-owned RX/TX ring + TX payload frames. The provider then observes the completion throughInterrupt.wait(kernel-injected dispatch,delivery_count_after == delivery_count_before + 1) andInterrupt.acknowledge(deferred LAPIC EOI retired,ack_delta == 1), re-maps itsDMABufferR/O and reads a non-zero received EtherType through its own mapping, unmaps, frees the buffer, and releases theInterruptcap. The kernel-sideon_releaseruns the masked-no-wake + reassign + stale-handle assertion chain on the bound RX route and emits exactly onecloudboot-evidence: virtio-net-userspace-provider <token>headline marker combining the RX publish outcome with the wait/ack delivery counts.- Spec basis: inherits the RX buffer-post / MSI-X wait/ack basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2,
PCI 3.0 §6.8.2). The userspace-submit delta drives the same RX
descriptor/avail/notify write sequence and the same bounded
used.idxpoll, but the descriptor’saddrfield is the userspace-allocatedDMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger, not a manager-allocated payload frame. - Implemented wire-format subset: at-boot
init()covers the status sequence + RX/TX queue materialization + mask-first MSI-X function enable + notify mapping + RX MSI-X route program. Per-callattempt_rx_submitcovers the RX descriptor publish (desc[0].addr = payload_physfrom the userspaceDMABuffer,flags = VIRTQ_DESC_F_WRITE), avail-ring entry +avail.idxbump with a release compiler fence, RX notify doorbell ring, the ARP TX stimulus, the used-ringused.idxbounded poll,used[0]slot read with an acquire compiler fence, the observed EtherType read, device reset, queue-state register read-back, and queue-frame release. Token grammar replaces the wait/ack marker’srxpay.<hex>field withpool.<u>.gen.<u>-buf.<u>.gen.<u>-payload.<hex>-rxlen.<u>so the manager-issued single-slot bounce-buffer pool’s slot/generation pair, the buffer’s slot/generation pair, the resolved payload host-physical address, and the requested receive length are observable from the marker. - Fail-closed assertions: inherits the RX buffer-post proof’s assertions for
status/queue/MSI-X bring-up and the RX descriptor publish + ARP stimulus +
polled completion + reset, and the wait/ack proof’s masked-no-wake +
reassign + stale-handle / stale-token assertion chain. The per-call
admission gate adds the
DMABuffer.submitDescriptorrequest-shape checks surfaced through the cap-sideDmaBufferSubmitDescriptorAdmissionshape:queue != 0fails closed withdmabuffer-rx-queue-required/non-rx-queue-rejected(TX is rejected explicitly; queue >= 2 trips the standardqueue-out-of-rangerequest gate),descriptor_id != 0fails closed withdescriptor-id-out-of-range,length > PAGE_SIZEfails closed withlength-exceeds-buffer, a live userspace VMA fails closed withdmabuffer-mapping-live, and a secondsubmitDescriptorwithout an interveningfreeBufferfails closed withdmabuffer-descriptor-already-inflight. The marker’s trailing labels flipprovider_rx_submitfromkernel-proxy-boundedtouserspace-brokered-bufferand addhost_physical_user_visible=0/direct_dma=blocked; the RX device-write DMA discipline (device_autonomous_raise=not-claimed,provider_visible_queue_address=hidden,iova_export=disabled-future-only,live_cloud=not-attempted) is preserved. Teardown confirms a device reset BEFORE the provider’sfreeBufferscrubs/frees the brokered buffer page. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-rx-userspace-submit. No GCE resources are created. Device-autonomous RX MSI-X delivery (programming the virtio queue’s RXqueue_msix_vector), multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.
- Spec basis: inherits the RX buffer-post / MSI-X wait/ack basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2,
PCI 3.0 §6.8.2). The userspace-submit delta drives the same RX
descriptor/avail/notify write sequence and the same bounded
-
RX production-IDT-dispatch waiter wake (
cap::virtio_net_rx_production_idt_dispatch_proof): carries the RX userspace-submit proof one authority step further. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_rx_production_idt_dispatch_proof(which impliescloud_virtio_net_rx_userspace_submit_proofand its predecessors). The RX publish half – device staging, provider-submitted brokered receive buffer, SLIRP stimulus, one real device->host RX DMA, polledused.idx– is reused unchanged from the userspace-submit predecessor (its module is dropped and this proof becomes the new headline owner; the device-manager admission routesattempt_rx_submithere). The load-bearing change is the production IDT dispatch wiring: the non-qemuarch::x86_64::lapic::handle_device_interruptarm previously discarded real device-MSI vectors with a bareeoi(), so a real interrupt-gate entry could never reach a deferred-EOI dispatch slot or wake anInterrupt.wait. This proof wires that arm to record an IDT handler entry and route the vector throughdevice_interrupt::handle_lapic_delivery, honoringeoi_deferred(the deferred-EOI path owns the EOI write, retired byacknowledge) and keeping the bareeoi()fallback for unregistered/out-of-pool vectors. TheInterrupt.waitcap method then fires ONE realINT $vectoron the bound RX route’s vector (IF cleared – the syscall context runs IF-cleared by SFMASK design andINT nignores IF; see the Fail-closed assertions bullet below) – graduating the qemu-onlyarch::lapic::inject_real_lapic_int_for_proofmechanic to this proof feature – so the waiter wakes through a real CPU interrupt-gate entry, not the synchronousdevice_interrupt::wait_kernel_injected_dispatchcall every prior RX/cap-waiter proof used.- Spec basis: Intel SDM Vol. 3 interrupt-gate semantics (an interrupt gate
clears
EFLAGS.IFon entry) and Vol. 2INT ndescription (which ignoresEFLAGS.IF); inherits the RX userspace-submit basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2) for the unchanged publish half. - Implemented wire-format subset: identical to the userspace-submit proof
for the publish half. The new surface is kernel interrupt-path wiring (no
new device wire-format): the production
handle_device_interruptnon-qemuarm (kernel/src/arch/x86_64/lapic.rs), a per-vector IDT handler-entry counter (device_interrupt::record_idt_handler_entry/idt_handler_entry_count), and the graduatedinject_real_lapic_int_for_proof. The device’s RXqueue_msix_vectorstaysVIRTIO_MSI_NO_VECTORand the PCI function mask stays held; theINTis fired by this proof, NOT by the device. - Fail-closed assertions: inherits the userspace-submit proof’s publish +
masked-no-wake + reassign + stale-handle / stale-token chain, and adds:
waitassertsdelivery_count_after == delivery_count_before + 1, the per-vector IDT handler-entry count advanced by exactly one (idt_handler_observed), the real-delivery delta equals the IDT-entry delta (direct_dispatch_call_count_unchanged, i.e. no fallback synchronous dispatch was used), and one deferred LAPIC EOI is pending; the masked-route assertion now fires a realINTthrough the masked route and asserts NOdelivery_countadvance and NO deferred-EOI pending/ack change. The cap-dispatch syscall context runs withEFLAGS.IFcleared by SFMASK design (arch::x86_64::syscall) andINT nignores IF, soint_fired_with_ifis recorded as observed evidence only (false in this build) and is NOT a gating condition. The headline marker flipsrx_completiontoreal-idt-interrupt-gate-wakeand addswaiter_wake=real-idt-interrupt-gate,idt_dispatch=production-wired, plus the trailing-idthandler.1-directcall.1-iffired.<0|1>-maskedint.1token booleans;device_autonomous_raise=not-claimedandlive_cloud=not-attemptedare preserved. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-rx-production-idt-dispatch. No GCE resources are created. Flipping the device’s RXqueue_msix_vector+ clearing the function mask so the DEVICE raises the MSI – reusing this proof’s now-proven production dispatch path – is now covered by the device-autonomous MSI-X proof below. Live-GCE RX evidence remains future work.
- Spec basis: Intel SDM Vol. 3 interrupt-gate semantics (an interrupt gate
clears
-
RX device-autonomous MSI-X delivery proof (
cap::virtio_net_rx_device_autonomous_msix_proof): carries the production-IDT-dispatch proof one authority step further and proves a device-raised virtio-net RX MSI-X reaches the production IDT path under local QEMU/KVM. Active under the non-qemucloud kernel built withcloud_virtio_net_rx_device_autonomous_msix_proof(which impliescloud_virtio_net_rx_userspace_submit_proofand reuses the same brokered RX publish path). The module enables PCI memory-space decoding and bus mastering, stages RX queue 0 and TX queue 1, enables MSI-X function-level control mask-first, programs RX queue 0COMMON_QUEUE_MSIX_VECTOR = 0, programs MSI-X table entry 0 throughcap::interrupt_programmed::program_attach_arm_unmask, clears the PCI function mask, and then submits one userspace-owned RX bounce buffer plus the ARP TX stimulus. The RX DMA succeeds (used[0].len > 0, observed EtherType0x0806), proving the data path remains the same brokered provider path.- Spec basis: virtio 1.2 §4.1.5.1.2 (modern per-queue MSI-X vector is the MSI-X table entry index), virtio 1.2 §2.7.6 / §2.7.8 / §5.1.6 for the RX descriptor/avail/used path, and PCI 3.0 §6.8.2 for MSI-X table entry and Message Control masking semantics.
- Implemented wire-format subset: the proof writes only the PCI COMMAND memory-space/bus-master bits, the RX queue config-vector selector, the same split-ring RX/TX descriptors and avail entries as the userspace-submit proof, one RX and one TX notify, one MSI-X table entry, and the PCI MSI-X function mask bit. It does not expose host-physical or IOVA addresses to userspace, does not program an IOMMU, and does not add multi-queue or full-NIC readiness.
- Proof assertions:
make run-cloud-provider-virtio-net-rx-device-autonomous-msixnow assertspci_command=0x0107, one device-raisedInterrupt.waitdelivery on vector0x50withint_injected=0,delivery_count_before=0,delivery_count_after=1,idt_handler_observed=true,eoi_deferred=true, and one deferred-EOIInterrupt.acknowledge(ack_delta=1). The finalcloudboot-evidence: virtio-net-userspace-providermarker includespcicmd.0107,idthandler.1,directcall.1,devraise.1,intinjected.0, andrx_completion=device-autonomous-msix. Closeout validation also keeps the RX production-IDT dispatch, RX userspace-submit, provider cap-waiter,run-net, and default boot-smoke gates green under local QEMU. - QEMU/KVM diagnosis: earlier bpftrace evidence showed QEMU reached
msix_notify(vector=0)with an unmasked MSI-X entry and prepared0xfee00000/0x50, but KVM did not accept vector0x50. The missing precondition was explicit PCI COMMAND bus-master enablement in this proof path; after the proof enables memory-space decoding + bus mastering, local QEMU/KVM delivers the MSI-X to the guest IDT path.
-
RX polled used-ring completion (no injected dispatch) (
cap::virtio_net_rx_polled_completion_proof): the first virtio-net proof whose RX completion signal is real driver progress, not an injected proxy. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_rx_polled_completion_proof(which impliescloud_virtio_net_rx_userspace_submit_proofand its predecessors). The RX publish half – device staging, provider-submitted brokered receive buffer, SLIRP stimulus, one real device->host RX DMA, polledused.idx– is reused unchanged from the userspace-submit predecessor (its module is dropped and this proof becomes the new headline owner; the device-manager admission routesattempt_rx_submithere). The load-bearing change is on the completion path: every prior virtio-net/cap-waiter proof signalled theInterrupt.waitcompletion throughdevice_interrupt::wait_kernel_injected_dispatch(a kernel-side dispatch-slot proxy) or, in the IDT-dispatch proof, a firedINT $vector– neither produced by real driver progress. Herevirtio_net_rx_polled_completion_proof::invoke_waitinstead reports the completion from the already-latched polled used-ring state captured duringattempt_rx_submit(thePublishedRxused_id == 0/used_len > 0/polled_used_idx >= POLL_TARGET_USED_IDX, latched from the predecessor’s reusedpoll_used_idxunder itsAcquirefence): there is NOwait_kernel_injected_dispatchcall and NOinject_real_lapic_int_for_proofanywhere in the wait/ack path, and zero kernel-injected interrupts.invoke_acknowledgeis a poll-confirmation no-op (no deferred LAPIC EOI to retire, since no interrupt was taken). The bound RX MSI-X route is still programmed at boot but is used ONLY by the release-time masked-no-wake/stale-handle assertion chain.- Spec basis: virtio 1.2 §2.7.8 (used ring is a memory-visible structure the
device advances) and §2.7.10 (the
VIRTQ_AVAIL_F_NO_INTERRUPTdriver flag the predecessor already sets, so the device performs no MSI either); inherits the RX userspace-submit basis (virtio 1.2 §2.7.6 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2) for the unchanged publish half. - Implemented wire-format subset: identical to the userspace-submit proof
for the publish half; no new device wire-format and no new kernel
interrupt-path wiring. The completion is a pure memory read of the latched
used-ring state plus a
device_interrupt::snapshot_dispatch_slotbefore/afterdelivery_countcomparison. - Fail-closed assertions: inherits the userspace-submit proof’s publish +
masked-no-wake + reassign + stale-handle / stale-token chain, and replaces
the wait/ack injection assertions with their polled inverse:
waitasserts the latchedused_id == 0/used_len > 0/polled_used_idx >= POLL_TARGET_USED_IDX(completion_observed) ANDdelivery_count_after == delivery_count_before(int_injected=0, no kernel dispatch advanced);acknowledgeasserts no deferred LAPIC EOI was pending and none was retired (hardware_dispatch_ack_delta == 0,eoi_written=false); andon_releaserequiresprovider_observed_dispatch == 0andprovider_observed_ack == 0(the inverse of the injected predecessor’s>= 1). The headline marker flipsrx_completiontopolled-used-ringand addswaiter_wake=polled-used-ring,int_injected=0, with the trailing-deliv.0-ack.0token booleans;device_autonomous_raise=not-claimedandlive_cloud=not-attemptedare preserved. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-rx-polled-completion. No GCE resources are created. Graduating this polled provider off the per-proof feature onto the defaultsystem.cuecloudboot manifest, programming the device’s RXqueue_msix_vectorfor device-autonomous delivery, and any live-GCE RX evidence are future work.
- Spec basis: virtio 1.2 §2.7.8 (used ring is a memory-visible structure the
device advances) and §2.7.10 (the
-
Polled RX+TX provider, always-built off the per-proof feature (
cap::virtio_net_polled_provider): graduates the polled provider above into the production compile set. The module is always-built in the default non-qemucloud kernel (cfg(not(feature = "qemu")), nocloud_*_prooffeature), derived fromcap::virtio_net_rx_polled_completion_proofwith the proof gate removed and the feature-gatedvirtio_net_tx_authority_bundle_proofbundle-observer calls dropped (the per-grant identity is still recorded through the always-builthardware_auditcap-audit). The polled completion behaviour (read the latchedpoll_used_idxused-ring state ininvoke_wait, nowait_kernel_injected_dispatch, noinject_real_lapic_int_for_proof, no-opinvoke_acknowledge) is identical; only the activation switch changes from a Cargo feature to a manifest-observable condition.kernel::run_initcallsvirtio_net_polled_provider::initonly when the booted manifest declares thecloud-provider-virtio-net-polled-provider-default-smokebinary, so on the literalsystem.cue,run-cloud-interrupt-grant, and every other default cloudboot manifest the provider is never staged (is_staged()==false) and is inert. Theinterruptcap is granted through the unchanged productioninterrupt_grant_source_prod(no newKernelCapSourcearm, no proof-only grant-source replacement); that source delegates its cap tovirtio_net_polled_provider::build_cap_for_grantwhile the provider is staged, otherwise it keeps its admission-check-only skeleton. The always-builtdevice_manager::stubsubmit-admission preview and accepted path admit RX queue 0 and routeDMABuffer.submitDescriptortovirtio_net_polled_provider::attempt_rx_submitonly while staged.- Marker: emits a DISTINCT
cloudboot-evidence: virtio-net-polled-provider <token>headline (vs the proof’svirtio-net-userspace-provider) so the two manifests are distinguishable, adding the labelsprovider_build=always-built-default-kernel,provider_feature_gate=none, andgrant_source=production-despecializedto the polled-completion label set. Thecap::provider_nic_bind_proof::reportre-point landed (theprovider-nic-boundmarker above now fires from this provider’s real polled TX+RX completion viareport_real_completionon thecloud-provider-nic-bound-real-polled-driver-smokemanifest); the literalsystem.cuefold is not yet implemented. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-polled-provider-defaulton the default non-qemukernel with nocloud_*_prooffeature. No GCE resources are created;live_cloud=not-attempted.
- Marker: emits a DISTINCT
-
Polled provider teardown / stale-authority (clean cap-op-release) – the always-built polled provider now carries an asserted S.11.2 teardown/stale-authority chain over its DMA + MMIO + IRQ authority on the clean cap-op-release path, not only the IRQ route. When the dedicated teardown manifest is booted (
run_initcallsvirtio_net_polled_provider::arm_teardown_reportbecause thecloud-provider-virtio-net-polled-teardown-smokebinary is declared), the provider’scomplete_after_release(kernel/src/cap/virtio_net_polled_provider.rs,run_teardown_assertions+emit_teardown_evidence) re-validates the brokered DMA + DeviceMmio authority the smoke released before theInterruptcap and emits one combinedcloudboot-evidence: virtio-net-polled-teardown <token>headline.- Mechanisms reused:
device_manager::validate_dmabuffer_record/validate_dmapool_record(stale DMA handle / stale pool-allocate rejected fail-closed),device_manager::last_bounce_page_release_evidence(the scrub-before-free / ledger-removed ordering stamped bydetach_dmabuffer_record_for_cap_release),device_manager::validate_devicemmio_recordoverdevicemmio_grant_source_prod::last_issued_handle_and_owner(the grantedDeviceMmiocap’s record is detached on release, so a stale access fails closed), and the inheriteddevice_interruptmasked-no-wake / reassign / stale-handle chain folded into the same marker. - Marker labels:
stale_dma_buffer_blocked=true,dma_page_scrubbed_before_free=true,dma_ledger_removed_after_scrub=true,stale_dma_pool_alloc_blocked=true,stale_mmio_blocked=true,mmio_handle_invalidated=true,masked_no_wake=true,reassign_generation_bumped=true,stale_token_wake_blocked=true,stale_route_handle_blocked=true,int_injected=0,host_physical_user_visible=0,direct_dma=blocked,iova_export=disabled-future-only. The marker is suppressed fail-closed if any leg regresses. The MMIO leg rides the grantedDeviceMmiocap; the provider’s ownpci::map_bar_regionBAR mapping stays boot-only with no kernel invalidation API by design. - Scope boundary: the clean cap-op-release marker carries
driver_death_teardown=not-attempted-this-slice. The process-exit-under-active-authority teardown trigger is its own proof (see the next entry); it crosses the process-lifecycle authority boundary, and the single-shot provider (oneInterruptcap, single-shotattempt_rx_submit) cannot drive both teardown paths in one boot, so it has its own manifest/boot. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-polled-teardownon the default non-qemukernel with nocloud_*_prooffeature.live_cloud=not-attempted.
- Mechanisms reused:
-
Polled provider DRIVER-DEATH / process-exit teardown – the always-built polled provider’s release-time teardown chain now also covers the process-exit-under-active-authority trigger. When the dedicated driver-death manifest is booted (
run_initcallsvirtio_net_polled_provider::arm_driver_death_reportbecause thecloud-provider-virtio-net-polled-driver-death-smokebinary is declared), the smoke drives the same real polled RX submit/wait/ack + RX read-back, scrubs+frees itsDMABuffer, and then exits while still holding itsDMAPool,DeviceMmio, andInterruptcaps. The kernel’sCapReleaseReason::ProcessExitcap-teardown reclaims all three in cap-table slot order (device_mmiothendmapoolbefore theinterruptcap), so the provider’scomplete_after_releaseprocess-exit arm (run_teardown_assertions+emit_driver_death_evidence) re-validates the now-stale DMA + DeviceMmio authority and runs the IRQ masked-no-wake / reassign / stale-handle chain over the route, emitting onecloudboot-evidence: virtio-net-polled-driver-death <token>headline.- Mechanisms reused: identical to the clean cap-op-release entry above, but
the DMAPool / DeviceMmio / Interrupt records are detached by the kernel’s
ProcessExitcap-teardown rather than explicitInterrupt.release/DMAPool.release/DeviceMmio.releasecalls. The runtime-allocatedDMABuffercap is torn down AFTER the manifest-grantedInterruptcap in slot order, so its page is scrubbed+freed by the smoke’sfreeBufferbefore exit (the normal DMABuffer lifecycle); the buffer is then re-validated stale with its scrub-before-free ordering intact. - Marker labels: same DMA / MMIO / IRQ / no-export discipline labels as the
clean-release marker, plus
driver_death_teardown=no-live-authorityandrelease_path=process-exit. The marker is suppressed fail-closed if any leg regresses (polled_completion_clean, the DMA/MMIO stale re-validation, or the IRQ chain). The clean-releasevirtio-net-polled-teardownheadline and the cap-op-release-gatedvirtio-net-polled-providerheadline both stay absent on this manifest (theInterruptcap is reclaimed viaProcessExit, not an explicit release). - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-polled-driver-deathon the default non-qemukernel with nocloud_*_prooffeature.live_cloud=not-attempted.
- Mechanisms reused: identical to the clean cap-op-release entry above, but
the DMAPool / DeviceMmio / Interrupt records are detached by the kernel’s
-
Provider-chain closeout: the parent
cloud-prod-virtio-net-userspace-provider-local-proofis closed by the decomposed child chain above and the legacy/transitional bind below. The local non-qemucloudboot/QEMU evidence now includes modern TX/RX userspace-provider proofs, the always-built polled provider, real-polled-driverprovider-nic-bound, clean-release/process-exit stale-authority proofs, and the legacy-polled path that later passed the real-GCEprovider-nic-boundgate. This closeout does not claim L4 socket/smoltcp relocation, literalsystem.cueprovider fold, reusable full-NIC/multiqueue readiness, or device-autonomous MSI-X delivery; those remain separate lanes.
4. Legacy / transitional virtio 0.9 PIO transport (cloud bind)
Everything above is the modern (virtio 1.x) transport: vendor capability
windows in MMIO BARs + MSI-X. Real GCE presents the NIC as a legacy /
transitional virtio 0.9 device instead (run 1780377997-281b, 2026-06-02):
PCI 1af4:1000, no modern vendor capability windows, no usable MMIO
memory BAR, legacy INTx, no MSI-X. The whole legacy virtio config block
lives in a PIO (I/O space) BAR0 register window, which the modern transport
discovery cannot represent, so the modern polled provider selects no candidate.
This section maps the legacy PIO transport subset the kernel implements to
bind that device. It has two parts: selection + brokered PIO config
(kernel/src/cap/virtio_net_legacy_select_proof.rs) and the
legacy single-PFN contiguous-queue polled TX/RX data path + provider-nic-bound
(kernel/src/cap/virtio_net_legacy_datapath_proof.rs); both are implemented and
proved locally.
- Spec basis: Virtual I/O Device (VIRTIO) legacy interface — the
pre-1.0 “Virtio PCI” I/O-BAR register layout (virtio 0.9.5 / the legacy
appendix of the OASIS 1.x spec, §4.1.4.8 “Legacy Interfaces”). Cross-checked
against QEMU
hw/virtio/virtio-pci.c(virtio_pci_config_*legacy I/O ops) and the Linuxvirtio_pci_legacydriver. - Implemented wire-format subset (no-MSI-X legacy I/O register block,
kernel/src/cap/virtio_net_legacy_select_proof.rsLEGACY_*offset constants): device features (0x00, u32 RO), guest/driver features (0x04, u32 RW), queue PFN (0x08, used by the data path), queue size (0x0c, u16 RO), queue select (0x0e, u16 RW), queue notify (0x10, u16 RW), device status (0x12, u8 RW), ISR status (0x13, u8 RO). The device-specific config (MAC, …) follows at0x14in the no-MSI-X layout (offsets shift by 4 only when MSI-X is enabled, which this polled path never does). Feature negotiation is the 32-bit legacy feature word:VIRTIO_F_VERSION_1(a high feature bit) is unrepresentable and absent, andVIRTIO_NET_F_MAC(1 << 5) is required and acknowledged. The legacy ring uses the single-PFN contiguous virtqueue layout (descriptor table + avail ring + padding toVIRTIO_PCI_VRING_ALIGN(4096) + used ring, addressed by one page-frame number written toLEGACY_QUEUE_PFNasphysical_address >> 12) — materialized by the data path (virtio_net_legacy_datapath_proof::materialize_queue). - Legacy single-PFN data path
(
kernel/src/cap/virtio_net_legacy_datapath_proof.rs): after the status + feature handshake, both virtqueues are materialized as single physically contiguous, page-aligned regions (frame::alloc_contiguous) whose descriptor/avail/used sub-addresses are computed from the contiguous base, so the in-ring desc/avail/used manipulation reuses the modern provider’s helpers (write_desc_slot_0,write_avail_*,poll_used_idx,read_used_ring_slot_0) unchanged — only the transport differs. The doorbell is a PIO write of the queue index toLEGACY_QUEUE_NOTIFY(no modern MMIO notify region). The virtio-net header is the 10-byte legacy header (noVIRTIO_NET_F_MRG_RXBUF; the modern path uses 12). The reset is polled with a bounded settle (real legacy hardware acknowledges reset asynchronously; QEMU clears synchronously) rather than a single-shot== 0. A real polled TX (queue 1) + RX (queue 0) completes by reading the used rings (no MSI-X route programmed, no interrupt taken or injected); the device is then reset clean and all DMA frames are freed. - Device-fixed queue size + contiguous-allocation bound
(
materialize_queue,MAX_LEGACY_QUEUE_SIZE,vring_layout): legacy virtio queue size is device-fixed and read-only (LEGACY_QUEUE_NUM,0x0c) — the driver cannot shrink it, so it must materialize whatever single-PFN vring the device advertises.materialize_queuereads that size and rejects a zero, non-power-of-two, or over-bound value cleanly (the vring layout requires a power-of-two size; the bound caps the contiguous allocation), then sizes the contiguous region viavring_layout: 256 → 3 pages, 1024 → 8 pages, and the live GCE Andromeda virtio-net’s 4096-entry queue → ~28 pages per queue.MAX_LEGACY_QUEUE_SIZEis the virtio spec maximum (32768, a power of two), so the bound admits any spec-legal device-fixed size — including GCE’s 4096 — while still failing closed above it; analloc_contiguousthat cannot satisfy the request fails closed (no panic) on the existingalloc … contiguous frames … failedarm. QEMU’s legacy virtio-net advertises 256 by default and caps queue size at 1024 (VIRTQUEUE_MAX_SIZE), and lockstx_queue_sizeat 256 for the non-vhost SLIRP device, so the largest local shape isrx_queue_size=1024(an 8-page RX vring, exercised bymake run-cloud-provider-nic-bound-legacy-large-queue); the exact 4096-entry materialization is only verifiable on real GCE (the billable live-GCE run). - GCE-viable RX stimulus / completion (
fill_dhcp_discover_legacy,read_device_mac,poll_rx_used_wall_clock): the TX stimulus is a broadcast DHCP DISCOVER sourced from the device’s real MAC (read from legacy device-config space at0x14), not the modern path’s ARP “who-has 10.0.2.2” from a hardcoded spoofed source. This is required for a real cloud NIC: GCE’s Andromeda SDN enforces MAC/IP anti-spoofing (egress from a non-assigned MAC/IP is dropped),10.0.2.2does not exist on the VPC, and no responder answers an ARP-for-the-gateway. A legitimately-sourced DHCP DISCOVER is answered by both QEMU SLIRP’s built-in DHCP server and the GCE SDN DHCP responder, giving a real device->host RX frame. The completion model is accept-any inbound frame (any non-empty frame with a readable EtherType satisfies RX, so an ambient gateway ARP/RA on GCE counts too), polled against a wall-clock budget (monotonic_nsdeadline, 5 s) rather than a fixed spin count sized for SLIRP’s instantaneous reply. Interrupts are masked during this boot-time proof, so the wall-clock budget relies on the TSC-calibrated clocksource (the QEMU and GCE case); a tick-derived clock is frozen here and a fixed iteration ceiling is the fail-closed backstop. The egress MAC is re-asserted non-zero / non-broadcast before the marker is emitted, and the marker token carries it (-srcmac.<12hex>). - Persistent legacy
Nic-cap runtime (virtio_net_legacy_datapath_proof::legacy_nic_runtime, kernel featurecloud_gce_legacy_virtio_webui_serving_proof): unlike the one-shot proofs, this runtime brings the legacy device up once at boot and keeps itDRIVER_OKfor the whole boot, backing the same typedNiccap methods the modern shim path serves (transmit @0,macAddress @2,linkStatus @3,receivePoll @4;receive @1fails closed). RX keeps a small posted buffer pool (RX_POOL_SIZEdescriptors, recycled in place after each copy-out);receivePollis non-blocking and compares the device-writtenused.idxagainst a consumed cursor (read_used_idx), so a frame burst that advances the index pastcursor + 1is drained one completion per call instead of being missed by an equality poll. TX publishes one frame at descriptor slot 0 and drains its completion with the same bounded advanced-past-cursor check; an unresolved or divergent completion (and any other ring-integrity violation on either queue) is a fatal error that tears the runtime down through a reset-confirmed fail-stop, after which every cap call fails closed. Frames cross the cap boundary as inlineDatawith the 10-byte legacy header added/stripped kernel-side; PIO, vring, and DMA-frame ownership stay kernel-side, and release quiesces the device (reset-confirmed before frames are freed).VIRTIO_NET_F_STATUSis not negotiated, solinkStatusreports assumed-up while the runtime is staged. This is the serving bridge the Phase C userspace network stack uses on the GCE NIC shape; proofmake run-cloud-gce-legacy-virtio-webui-serving(host HTTP peer fetches the remote-session Web UI bundle through QEMUhostfwdover this datapath). - capOS mapping (brokered PIO config access): capOS device authority
(
DeviceMmio/DDF) is MMIO/memory-BAR based and there is no I/O-port capability. The legacy config window stays kernel-owned: the only sanctioned path to a device’s legacy I/O BAR is the bounds-checkedpci::LegacyIoBaraccessor (kernel/src/pci.rs), reached throughpci::io_bar(device, bar). Every access is range-checked against the device’s claimed I/O BAR window and the 16-bit x86 port space, so a caller cannot reach a port outside the BAR; there is no ambientin/outauthority and no port-I/O surface exposed to userspace. PCI I/O decoding is enabled per device viapci::enable_io_space_and_bus_masterbefore any access. - Candidate gate (MSI-X not required): the legacy candidate selection
(
virtio_net_legacy_select_proof::pick_legacy_candidate) accepts a transitional virtio-net function (1af4:1000, network class) whose modern common-config window does not resolve and which exposes a usable I/O BAR0 — without requiring an MSI-X capability, because the polled data path does not depend on interrupt delivery. This is the deliberate relaxation of the modern gate (virtio_net_polled_provider::candidate_from_device, which requires both the modern transport and MSI-X). - Fail-closed rules: the brokered status handshake fails closed on any
out-of-window access, any device-status regression, a missing
VIRTIO_NET_F_MAC, a guest-feature write-back mismatch, or a zero queue-0 size. The data path additionally fails closed on a device-MAC read failure (out-of-window, all-zero, or broadcast), an out-of-window queue PFN/notify access, a PFN read-back mismatch, an advertised queue size that is zero or exceeds the materialization bound, a TX used-ring poll-budget exhaustion, an RX wall-clock-budget exhaustion, a used-ring id/len regression, a zero EtherType, or a final reset that does not settle to0x00. On any such failure noprovider-nic-boundmarker is emitted (the gate staysnull/fail-closed), and the device is reset and its DMA frames freed regardless of outcome. The completion is observed by polling the legacy used ring with no MSI-X route programmed and no interrupt taken or injected; theprovider-nic-boundmarker (provider_nic_bind_proof::report_real_completion_legacy) carries honesttransport=legacy-pio-virtio-0.9,interrupt_model=polled-no-msix, anduserspace_driver_authority=kernel-brokered-legacy-polledlabels, and nothing is exported to userspace (host_physical_user_visible=0,direct_dma=blocked). - QEMU-emulable vs hardware-only: the legacy shape is QEMU-emulable via
qemu-system-x86_64 -device virtio-net-pci,disable-modern=on,vectors=0(legacy I/O BAR0, INTx, no MSI-X — the faithful GCE shape). Proved locally bymake run-cloud-provider-virtio-net-legacy-selecton the default non-qemukernel: the kernel selects the legacy NIC over its I/O BAR0, runs the brokered device-status handshake + 32-bit feature read (observeddevice_features=0x79bf8064,VIRTIO_NET_F_MACset) + queue-0 size read (256), and emitscloudboot-evidence: virtio-net-legacy-candidate-selected <token>. The data path is proved bymake run-cloud-provider-nic-bound-legacyon the same device shape: the kernel reads the device’s real MAC (52:54:00:12:34:56under QEMU), materializes the legacy single-PFN virtqueues, TX-submits a broadcast DHCP DISCOVER from that MAC, and completes a real polled RX against the wall-clock budget (observedsrc_mac=52:54:00:12:34:56,rx_used_len=600,ethertype=0x0800— the SLIRP DHCP OFFER, an IPv4 frame —tx_used_idx=1,rx_used_idx=1,rx_clock_usable=true,final_status=0x00), then emits exactly onecloudboot-evidence: provider-nic-bound <token>sourced from that completion (token carries-ethertype.0800and-srcmac.525400123456). The DHCP-discover-from-real-MAC stimulus is the GCE-viable path: GCE’s Andromeda SDN drops egress from a spoofed source MAC/IP and has no10.0.2.2ARP responder, so the modern path’s spoofed ARP-to-SLIRP-gateway stimulus would time out on a real NIC; a real-MAC DHCP DISCOVER is answered by both SLIRP and the GCE SDN. The follow-up billable real-GCE runcloud-prod-gce-billable-boot-real-polled-nic-boundpassed on 2026-06-02 15:03 UTC through this legacy path: the live1af4:1000NIC bound at00:04.0, materialized the 4096-entry RX/TX vrings, transmitted DHCP DISCOVER from the device MAC, received a real IPv4 frame (rx_used_len=532,ethertype=0x0800), and emittedprovider-nic-boundfromreport_real_completion_legacy. This remains a bounded raw-frame bind proof, not L4 networking or a reusable userspace provider service.
Related
kernel/src/virtio.rs– PCI transport discovery, split-ring transport, feature negotiation, framing.kernel/src/cap/network.rs– accepted-socket cap state and the network capability surface.docs/proposals/networking-proposal.md– the userspace network-stack move (Phase C) and the transitional-kernel status.docs/dma-isolation-design.md– the DMA backend and isolation model the userspace successor binds into.