virtio-net (modern PCI NIC)

This is a provenance map for the in-tree virtio-net driver: it cites the spec, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec – where the spec is implemented unchanged, it links rather than transcribes. The driver is mature and transitional (in-kernel today, slated to move to a userspace network-stack process), so the treatment is a concise map rather than exhaustive register tables.

1. Spec basis

Device: virtio network device, modern (virtio 1.x) PCI transport. PCI vendor 0x1af4; device 0x1041 (modern) / 0x1000 (transitional). IDs at kernel/src/pci.rs (VIRTIO_VENDOR_ID, VIRTIO_NET_MODERN_DEVICE_ID, VIRTIO_NET_TRANSITIONAL_DEVICE_ID; matched by PciDevice::is_virtio_net).
Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.1 (network device).
Reference: cross-checked against the Linux virtio_net and virtio_pci_modern drivers for the modern-transport handshake and split-ring layout.

2. Wire format (implemented subset)

Only the modern split-ring subset the driver uses is summarized here; feature bits and structures the spec defines but the driver does not specially handle are linked, not transcribed.

PCI capabilities / BAR layout: virtio modern PCI vendor capabilities (common / notify / ISR / device / PCI cfg) parsed from the capability list; type constants VIRTIO_PCI_CAP_COMMON_CFG / ..._NOTIFY_CFG / ..._ISR_CFG / ..._DEVICE_CFG / ..._PCI_CFG and length floors VIRTIO_PCI_CAP_MIN_LEN / VIRTIO_COMMON_CFG_MIN_LEN in kernel/src/virtio.rs; common-config register offsets are the transport::COMMON_* constants (COMMON_DEVICE_FEATURE, COMMON_QUEUE_SELECT, COMMON_QUEUE_NOTIFY_OFF, …). The notify capability carries notify_off_multiplier (ModernTransport::notify_off_multiplier) used to compute per-queue notify addresses.
Split-ring layout: 16-byte descriptors (transport::VIRTQ_DESC_SIZE), available and used ring offsets, and the transport::VIRTQ_DESC_F_NEXT / transport::VIRTQ_DESC_F_WRITE flags. Descriptor lifecycle is generation-tracked through a bounded transport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS slot array (DescriptorTrackingSlot).
Queues: RX queue (VIRTIO_NET_RX_QUEUE), TX queue (VIRTIO_NET_TX_QUEUE), negotiated to a bounded target size (VIRTIO_NET_QUEUE_TARGET_SIZE); the target size must not exceed the tracking slot count (compile-time assert! against transport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS).
Net header / framing: 12-byte virtio_net header prepended to frames (VIRTIO_NET_HDR_LEN); proof TX buffers carry the header plus a minimum Ethernet frame (TX_PROOF_BUFFER_LEN, TX_PROOF_ETHERNET_OFFSET).
Feature negotiation: device/driver feature select/read registers in the common config; the driver negotiates transport::VIRTIO_F_VERSION_1 / transport::VIRTIO_F_ACCESS_PLATFORM (generic, in transport) plus the net-specific VIRTIO_NET_F_MAC (1 << 5) and acknowledges VIRTIO_NET_F_MRG_RXBUF (1 << 15).

3. capOS mapping

Binding (transitional): virtio-net is currently driven in the kernel. PCI/MSI-X transport discovery, the split-ring transport, smoltcp, TCP listeners, the line discipline, and the Telnet IAC filter live in kernel/src/virtio.rs and kernel/src/cap/network.rs. This is explicitly transitional: Phase C of the networking proposal (docs/proposals/networking-proposal.md) moves the NIC driver and stack into a userspace network-stack process once the userspace-driver authority gate applies to it. Until then it does not bind through the DeviceMmio/Interrupt/DMAPool provider grants the DDF cloud-NIC drivers use; the sections below describe its kernel-owned equivalents.
MMIO: modern-transport common/notify/ISR/device config regions are mapped from the device BARs and accessed through the transport MMIO helpers (kernel/src/virtio.rs transport module). Doorbell (queue-notify) writes are scoped to the per-queue notify address computed from notify_off_multiplier; the DDF DeviceMmio cap (kernel/src/cap/device_mmio.rs) is the userspace successor surface. The production (not(feature = "qemu")) DeviceMmio grant source (kernel/src/cap/devicemmio_grant_source_prod.rs) binds to a virtio-net function only when the manifest authors that function’s stable PCI identity in SystemConfig.deviceMmioTarget (not a first-usable-BAR scan), and it excludes the function’s MSI-X table / PBA windows from the brokered read32 surface: the MSI-X BAR (BAR1 under QEMU, table at 0x0 / PBA at 0x800) is bound, but a read into either window fails closed (devicemmio-read32-msix-window-excluded) while non-MSI-X offsets read normally. Proof: make run-cloud-devicemmio-grant. The userspace-handshake proof path stages a narrower window instead – the modern common-configuration structure, whose BAR/offset/length the device authors through its own virtio vendor capabilities. Because a device could point common_cfg at its own MSI-X structures, that window is classified against the function’s MSI-X layout before it is mapped, and staging fails closed on an overlap (result=common-config-overlaps-msix) or an undeterminable layout (result=common-config-msix-layout-unknown). A read-exclusion is not sufficient there: the selected-write claim admits writes at fixed window-relative offsets, which an exclusion does not bound. Proof: make run-cloud-prod-nic-driver-userspace-features-ok. The notify window the RX-DMA bring-up path maps kernel-side is device-authored the same way and is written to (doorbells), so it is held to the same disjointness rule (result=notify-capture-failed). Proof: make run-cloud-prod-nic-driver-userspace-rx-bringup.
Interrupt: MSI-X vectors are programmed for config and per-queue interrupts; route records and vector dispatch are tracked by the kernel-owned device-interrupt ledger (kernel/src/device_interrupt.rs). The make run-net smoke asserts MSI-X metadata selection, vector-pool/exhaustion policy, masked route lifecycle, queue vector assignment, descriptor guards, ARP, and ICMP. Device-autonomous delivery proofs live in the dedicated userspace-provider MSI-X gates, not in the retired kernel L4 owner.
DMA: ring pages and TX/RX buffers are allocated and accounted through the net-keyed kernel DMA ledger (kernel/src/device_dma.rs). make run-net runs without an emulated IOMMU, so DMA uses the intended bounce-buffer fallback; no host physical address or IOVA is exposed beyond the kernel boundary.
DMA teardown contract (reset before free): a DRIVER_OK/bus-master virtio-net device must be reset+quiesced before any of its queue or buffer frames are returned to the allocator, so a still-live device cannot DMA into a frame a later owner now holds. impl Drop for VirtioNetDriver (kernel/src/virtio.rs) writes COMMON_DEVICE_STATUS = 0 and waits for the reset before its rx/tx/rx_buffers/tx_*_buffer DmaPage fields drop and free; if the reset cannot be confirmed it quarantines those frames (DmaPage::quarantine, leaked rather than freed). Every initialize error path after DRIVER_OK (including the DMA-accounting proof failures) unwinds through this Drop. The run-virtio-net-teardown-reset-before-free proof (the virtio_net_teardown_reset_before_free_proof feature) forces the DMA-accounting pool proof to fail after DRIVER_OK and asserts the reset-before-free evidence marker. The queue-materialization proofs (kernel/src/cap/virtio_net_{tx,rx}_queue_materialization_proof.rs) reset the device before releasing their frames for the same reason; they additionally publish no avail entry and ring no doorbell, so the device never references those frames.
Production cloud build cfg surgery (DMA ledger + DDF caps): kernel/src/device_dma.rs and the cap surfaces kernel/src/cap/dma_pool.rs (DmaPoolCap/DmaPoolCapInfo) and kernel/src/cap/dma_buffer.rs (DmaBufferCap/DmaBufferCapInfo) compile in the non-qemu build. The cloud-prod-dmapool-bounce-buffer-grant-proof wires the first production caller through kernel/src/cap/dmapool_bounce_buffer_grant_proof.rs: it stages a parked manager-attached DMAPool record over one DMA-capable PCI function from the inventory (stage_bounce_buffer_dmapool_record in kernel/src/device_manager/stub.rs), builds a DmaPoolCap over the parked handle, allocates one bounded bounce-buffer DMABuffer through device_manager::issue_manager_attached_dmabuffer_handle_with_request (which routes to device_dma::allocate_manager_attached_dmapool_bounce_buffer_page), asserts cap-info labels (userspace_dmapool=manager-issued-bounce-buffer, allocation=single-bounce-buffer-page, real_dma=not-attempted, direct_dma=blocked, host_physical_user_visible=0, iova_export=disabled-future-only), the dma_backend::select_and_report bounce-buffer verdict, quiesce-before- release (release_dmapool_record_for_cap_release returns pending-buffer-release while the buffer is live), scrub-before-reuse (the released bounce-buffer frame is zeroed in place before the frame returns to the allocator), and stale-handle-after-detach, then emits cloudboot-evidence: dma-pool-grant <token> for the cloudboot harness. The qemu-only surface that stays gated includes the cap::dmapool_grant_source bootstrap source (kernel/src/cap/dmapool_grant_source.rs), the KernelCapSource::DmaPool grant arms in kernel/src/cap/mod.rs and kernel/src/cap/process_spawner.rs, the DmaBufferCompleteDescriptorAdmission::provider_cq_event field that carries cap::interrupt_grant_source::ProviderCompletionCqEventIdentity, and the entire kernel/src/device_manager/qemu_full.rs DDF backend (including device_dma::{begin_virtio_net_pool, allocate_virtio_net_page, ...}). The proof maps no userspace VMA, programs no real DMA, attaches no queue, programs no interrupt, and emits no provider-nic-bound / storage-bound; descendants in docs/backlog/hardware-boot-storage.md#cloud-device-tracks cover those.
Fail-closed rules: requested ranges are validated against device-reported geometry and destination buffer length before any device access; descriptor reuse is generation-tracked; the bounded tracking-slot array (transport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS, DescriptorTrackingSlot) caps in-flight descriptors. Stale/over-range requests fail closed.
QEMU-emulable vs hardware-only: fully QEMU-emulable. QEMU provides virtio-net-pci; make run-net is the end-to-end proof. No hardware-only path – this is the local-binding reference the cloud NIC drivers (ENA, MANA, GCP virtio-net) mirror for their QEMU-provable halves.
GCP cloud-shape classification: GCP 1st/2nd-gen x86 non-Confidential machine families (e.g. n1-*, e2-*) present the virtual NIC as exactly this standard virtio-net device (vendor 0x1af4) under a no-IOMMU / SWIOTLB bounce-buffer DMA backend, so the QEMU virtio-net-pci binding is the local precursor for the GCP NIC path. The enumeration path emits a virtio-net: cloud shape classification proof line (kernel/src/pci.rs report_cloud_virtio_net_shape) classifying the enumerated function against that documented GCP surface; both make run-net and make run-ddf-provider-consumer assert it conjunctively with the GCP-mapped bounce-buffer dma: backend selection line (kernel/src/dma_backend.rs select_and_report). The GCP→bounce-buffer mapping itself is the support-policy expectation recorded in docs/research/cloud-dma-provider-evidence.md. The proof carries explicit scope flags (local_qemu_precursor=true, real_gcp_enumeration=not-claimed, gvnic=separate-driver-out-of-scope); live GCP enumeration and cloud used-ring ownership remain cloud-gcp-virtio-net-nic-driver.
Production cloud-boot evidence marker (dma-backend): the production boot path (the kernel built without the qemu feature, which is what make capos-cloudboot-image packages) emits the parseable cloudboot-evidence: dma-backend <token> serial marker the tools/cloudboot/ harness reads (serial_marker_tokens; “Serial evidence-marker contract” in tools/cloudboot/README.md). It is emitted by kernel/src/dma_backend.rs select_and_report (always-compiled, so it fires on the production cloud image, not just the qemu smoke build) alongside the human-readable dma: backend selection line. The marker uses the harness token namespace (direct_dma / trusted_domain / bounce_buffer), mapped from the resolved DmaBackend by cloudboot_evidence_token – deliberately not the DmaBackend Display string (direct-remapping / bounce-buffer). The current two-variant resolved backend maps to direct_dma / bounce_buffer; on the probed GCE shapes (IOMMU disabled) the value is bounce_buffer. The trusted_domain slot has no current producer and is reserved. This marker is honest read-side evidence of the boot-time DMA-backend selection; it asserts no device bind and is independent of the bound-through-authority provider-nic-bound marker, which remains the cloud-gcp-virtio-net-nic-driver claim.
Production cloud-boot evidence marker (device-class): the production boot path also emits the companion cloudboot-evidence: device-class <token> serial marker (one per distinct enumerated PCI base class, harness-deduped via sort -u; “Serial evidence-marker contract” in tools/cloudboot/README.md).
- Spec basis: PCI base-class codes from the PCI Code and ID Assignment Specification (PCI-SIG); the base class is the high byte of the class-code/revision dword at config-space offset 0x08 (kernel/src/pci.rs PCI_CLASS_REVISION).
- Implemented wire-format subset: a genuinely read-only config-space scan over the production source resolved from the boot-time MCFG probe: ECAM when MCFG validates, otherwise legacy I/O. report_cloudboot_device_class_evidence (kernel/src/pci.rs) walks each bus/device/function via for_each_enumerated_function and the read-only functions_to_scan helper, reading only the vendor-id, header-type, and class-code (PCI_CLASS_REVISION) words. The base class is the high byte of PCI_CLASS_REVISION. It deliberately does not call read_device/read_bars, which would perform transient BAR-sizing config writes. Each distinct base class is emitted once in ascending order, formatted {:#04x} (e.g. 0x02). The marker is emitted from kernel/src/main.rs run_init, so it fires on every build configuration (not only the qemu/diagnostics PCI-diagnostics path), including the non-qemu production cloud image.
- capOS mapping: enumeration evidence only – it allocates no DeviceMmio/Interrupt/DMAPool, claims no device ownership, performs no bus-master enable, BAR mapping, BAR-sizing write, or DMA, and never emits provider-nic-bound.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows base classes 0x01 (storage), 0x02 (network), 0x03 (display), and 0x06 (bridge). No GCE resources are created and no make cloudboot-test run is required.
Production cloud-boot evidence marker (device-inventory): the production boot path also emits a per-function PCI claim-identity inventory so later bind children discover the real device identity instead of assuming the QEMU-fixed BDF layout the --features qemu path hard-codes. It emits a human-readable pci-inventory: detail line per enumerated function plus the parseable cloudboot-evidence: device-inventory <token> marker (one per function, harness-deduped via sort -u; “Serial evidence-marker contract” in tools/cloudboot/README.md).
- Spec basis: the PCI Local Bus Specification (PCI-SIG) Type 0 configuration header — vendor/device ids at offsets 0x00/0x02, the class-code triple (base class / subclass / prog-if) in the high three bytes of the class-code/revision dword at offset 0x08, header type at offset 0x0e, and interrupt line / pin at offset 0x3c (§6.1 “Configuration Space Organization”). BAR registers are not part of this production marker.
- Implemented wire-format subset: report_cloudboot_device_inventory_evidence (kernel/src/pci.rs) walks each bus/device/function via for_each_enumerated_function and the read-only functions_to_scan helper. For each present function, read_cloudboot_inventory_record reads only vendor/device, class/subclass/prog-if, revision, header type, interrupt pin, and interrupt line. report_cloudboot_inventory_record formats one identity token: <seg>.<bus>.<dev>.<fn>-<vendor>.<device>-<class>.<subclass>.<progif>-rev.<rev>-hdr.<hdr>-irq.<pin>.<line>. It is emitted from kernel/src/main.rs run_init right after the device-class markers, on every build configuration including the non-qemu production cloud image.
- capOS mapping: read-only enumeration evidence. The production marker performs no BAR-size probe, config-space write, BAR mapping, bus-master / memory-space / IO-space command-bit enable, doorbell write, DMA, or device ownership claim, and never emits provider-nic-bound. The later cloud-NIC bind children consume this inventory to resolve the real PCI function identity instead of the QEMU-fixed BDF fixtures; BAR/MMIO authority is proven by separate DeviceMmio evidence paths.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows the per-function pci-inventory: lines and device-inventory markers for the emulated functions (virtio-net 1af4, storage, display, bridge). No GCE resources are created and no make cloudboot-test run is required.
Production-build device_manager / DeviceMmio compile surface: kernel/src/device_manager/mod.rs is now always compiled, but it is a thin orchestrator that re-exports a shared subset (error.rs, handles.rs, mmio.rs, types.rs — DeviceManagerError, DeviceMmioHandle / DeviceOwner / PciBdf / DeviceMmioRegion, the MMIO record / map / unmap / read32 / write32 admission types, DeviceMmioCapReleaseOutcome, ProviderNotifyDoorbellWrite) plus a feature-gated implementation: under cfg(feature = "qemu") it routes through qemu_full.rs (the full DDF surface — dma_buffer.rs / dma_pool.rs / interrupt.rs / proofs.rs, NVMe brokered controller registers, IOMMU domain ledgers, virtio TX/RX ring publication); under cfg(not(feature = "qemu")) it routes through stub.rs, which now carries a bounded one-slot parked-region path used by the production bar-readback proof (stage_bar_readback_region, validate_devicemmio_record, read_devicemmio_u32, detach_devicemmio_record_for_cap_release, trigger_*_for_devicemmio). The DMA/write/notify/map shims still report DeviceMmioStaleHandle because no production caller exists yet for those; the descendant slices in docs/backlog/hardware-boot-storage.md un-gate them through the reviewed grant path. kernel/src/cap/device_mmio.rs and its super::hardware_audit / super::hardware_release_log audit hooks are likewise always compiled. The KernelCapSource::DeviceMmio user-facing grant arm in kernel/src/cap/mod.rs stays cfg(feature = "qemu")-gated; the production bar-readback proof builds its DeviceMmioCap from boot (cap::devicemmio_bar_readback, see below) without going through that user-facing grant arm. The crate::iommu module and the real kernel/src/virtio.rs stay cfg(feature = "qemu")-gated. The crate::device_dma module compiles in both builds for the dmapool-grant proof, and the crate::device_interrupt module compiles in both builds for the interrupt route/source allocation proof below; their KernelCapSource::Interrupt user-facing grant arm and interrupt_grant_source bootstrap-grant module in kernel/src/cap/mod.rs stay cfg(feature = "qemu")-gated.
Production cloud-boot evidence marker (device-mmio-bar-read): the production boot path also exercises one PCI function’s first memory BAR through the reviewed DeviceMmioCap read32 surface and emits a parseable cloudboot-evidence: device-mmio-bar-read <token> marker.
- Spec basis: PCI Local Bus Specification (PCI-SIG) Type 0 memory BAR semantics. The marker carries the function’s BDF, the BAR index, the 32-bit value read at offset 0, and the kernel-mapped window length. The kernel-side cache policy is device-uncacheable (UC) + NX + GLOBAL + WRITABLE, matching the existing mem::paging::map_kernel_mmio_range contract for MMIO windows.
- Implemented wire-format subset: cap::devicemmio_bar_readback::report (kernel/src/cap/devicemmio_bar_readback.rs) enumerates PCI functions via pci::enumerate(), picks the first with a memory BAR of at least 4 KiB at a non-zero base, maps the first 4 KiB of that BAR through mem::paging::map_kernel_mmio_range, stages a parked region through device_manager::stage_bar_readback_region (one slot, mapping generation monotonic), constructs a DeviceMmioCap over the resulting DeviceMmioHandle, and calls cap.read32(0). The read goes through the same validate_devicemmio_record → range/alignment check → read_volatile path as the qemu DDF surface; on the production path the parked region’s recorded kernel virtual address backs the read. The marker token shape is <seg>.<bus>.<dev>.<fn>-b<bar>-<value>-len.<len> (value in 32-bit hex with 0x prefix, length in hex bytes), inside the harness grammar [A-Za-z0-9._-]+.
- Fail-closed assertions: the proof immediately retries read32 at exactly length and asserts range_result != "ok" (out-of-range read is rejected with no MMIO side effect), then detaches the parked record through detach_devicemmio_record_for_cap_release and asserts the next read32(0) fails closed at the device manager (DeviceMmioStaleHandle). Both outcomes are logged on a devicemmio-bar-readback: range_bounding ... / stale_generation ... line so a regression trips the boot log alongside the missing marker.
- capOS mapping: the mapping is boot-only kernel-half (no userspace VMA is exposed by this proof); revocation drops the parked slot, which invalidates the cap-side identity without removing the kernel mapping itself (the boot-only mapping stays installed for the rest of the boot). The descendant userspace-driver slices in docs/backlog/hardware-boot-storage.md#cloud-device-tracks add the userspace VMA path with TLB shootdown on revoke.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows the marker for the emulated virtio function. No GCE resources are created and no make cloudboot-test run is required. The qemu build keeps the existing make run-devicemmio-grant smoke as the end-to-end DDF proof; the bar-readback caller in cap::devicemmio_bar_readback is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s own DeviceMmio claim path.
Production cloud-boot evidence marker (interrupt-route-allocated): the production boot path also exercises one PCI function’s MSI-X capability through the reviewed device_interrupt vector pool and emits a parseable cloudboot-evidence: interrupt-route-allocated <token> marker.
- Spec basis: PCI Local Bus Specification (PCI-SIG) 3.0 §6.8.2 / PCI Express Base Specification 4.0 §7.7.2.2 MSI-X capability structure. The capability header dword exposes Control (function-mask, table-size-1, enable), the Table BIR/Offset dword exposes the BAR index in the low 3 bits and the byte offset in the upper bits (each table entry is 16 bytes), and the PBA BIR/Offset dword exposes the Pending Bit Array location. The marker carries the function’s BDF, the selected MSI-X table entry, the kernel-pool MSI vector, and the route/source generation pair allocated for the entry. No live MSI-X table write or device interrupt is performed on this path.
- Implemented wire-format subset: cap::interrupt_route_alloc::report (kernel/src/cap/interrupt_route_alloc.rs) enumerates PCI functions via pci::enumerate(), walks each function’s capability list through pci::capabilities, parses MSI-X capability fields through pci::interrupt_capabilities / parse_msix_capability (offset, control, table_size, table_bir, table_offset, pba_bir, pba_offset, both validated through the existing MSI-X region BAR checks), picks the first MSI-X capability with table_size >= 1, and allocates a kernel-owned MSI vector + interrupt source/route record over its first table entry (SELECTED_TABLE_ENTRY = 0) through the production device_interrupt::register_pci_msix_route_by_bdf vector pool (kernel/src/device_interrupt.rs, lapic::DEVICE_MSI_VECTOR_BASE = 0x50, 16 device-MSI vectors). It then device_interrupt::claim_routes the route under DeviceInterruptDriver::ManagerGrantSource. The marker token shape is <seg>.<bus>.<dev>.<fn>-entry.<n>-vector.<hex>-src.<id>.gen.<g>-route.gen.<g> (vector in 2-digit hex, source-id and generations decimal), inside the harness grammar [A-Za-z0-9._-]+.
- Fail-closed assertions: the proof asserts three invariants inline before emitting the marker. (1) Claimed-state visibility: validate_claimed_route succeeds for the correct ManagerGrantSource owner and fails closed with WrongOwner for a distinct KernelIoApicProof owner – the route is owner-scoped. (2) Duplicate-source rejection: a second register_pci_msix_route_by_bdf against the same (bdf, table_entry) while the original route is live is rejected with DuplicateSource – the source identity is unique. (3) Stale-after-release: release_claimed_route clears the slot and a subsequent validate_claimed_route on the same handle fails closed with StaleRoute – no stale handle can re-enter the route table. Each outcome is logged on an interrupt-route-alloc: claimed_state ... / duplicate_source ... / stale_after_release ... line so a regression trips the boot log alongside the missing marker.
- capOS mapping: route/source-allocation evidence only. The proof parses the MSI-X capability and consumes one slot from the kernel-owned device-MSI vector pool, then returns it on release; it does NOT map the MSI-X table or PBA BAR window, write a table entry, program a LAPIC dispatch slot for live delivery, raise/handle a device interrupt, install a waiter, acknowledge an EOI, or exercise mask/unmask/reset on the live vector. No provider-nic-bound or storage-bound marker. The follow-on live-delivery proof (interrupt-route-delivered) extends this surface; see the next section. The cap::interrupt_route_alloc caller is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s own Interrupt claim path.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows the marker for the emulated virtio function (QEMU’s modern virtio-pci front-end exposes a per-function MSI-X capability). No GCE resources are created and no make cloudboot-test run is required. The qemu build keeps the existing make run-interrupt-grant and make run-net smokes as the end-to-end DDF and virtio-net MSI-X proofs.
Production cloud-boot evidence marker (interrupt-route-delivered): the production boot path then extends the route-allocation proof to live MSI-X delivery: it programs the table entry, attaches the route to device manager, arms the deferred-LAPIC-EOI gate, injects one grant- source dispatch, retires the deferred EOI, masks and re-injects to prove no stale wake, reassigns to bump the route generation, asserts the stale handle + stale pending token both fail closed, then releases. Emits one cloudboot-evidence: interrupt-route-delivered <token> marker.
- Spec basis: PCI Local Bus Specification 3.0 §6.8.2 / PCI Express Base Specification 4.0 §7.7.2.2 MSI-X table entry layout (16-byte entries: 64-bit message address, 32-bit message data, 32-bit vector control with bit 0 = entry mask) plus the per-spec mask-first write ordering (the entry mask must be asserted before message address/data are torn). Intel SDM Vol. 3A §10.8 LAPIC EOI semantics for the deferred-EOI write retired by acknowledge_deferred_lapic_eoi_for_route against the LAPIC EOI register (arch::x86_64::lapic::eoi).
- Implemented wire-format subset: cap::interrupt_delivery_proof::report (kernel/src/cap/interrupt_delivery_proof.rs) reuses pci::map_msix_table to map the MSI-X table BAR window kernel-side (UC + NX + GLOBAL + WRITABLE through mem::paging::map_kernel_mmio_range) and pci::write_msix_table_entry to program entry 0 with the route’s message_address (from arch::x86_64::lapic::current_device_msi_delivery) and message_data (the allocated kernel-pool vector) under per-spec mask-first ordering. It then attaches the route through device_interrupt::attach_claimed_route_to_device_manager, enables the deferred-LAPIC-EOI gate via device_interrupt::enable_deferred_lapic_eoi_for_route, unmasks the route through device_interrupt::unmask_device_manager_attached_route and the table entry through pci::set_msix_table_entry_mask, drives one injected dispatch through device_interrupt::handle_lapic_delivery (the same dispatch slot the qemu make run-interrupt-grant proof and nvme-admin-interrupt-delivery exercise), retires the deferred EOI via device_interrupt::acknowledge_deferred_lapic_eoi_for_route, masks both surfaces and re-injects through device_interrupt::record_lapic_delivery, reassigns via device_interrupt::reassign_claimed_route to bump the route generation, and asserts stale-handle / stale-pending-token rejection through device_interrupt::validate_claimed_route / device_interrupt::check_pending_lapic_token.
- Fail-closed assertions: five inline assertions gate the marker. (1) Live delivery: handle_lapic_delivery returns a Delivered { .. } outcome bound to the live route’s (source_id, source_generation, route_generation, owner), delivery_count advances by 1, eoi_deferred=true, and pending_deferred_eoi_count >= 1. (2) Ordered acknowledge: acknowledge_deferred_lapic_eoi_for_route reports eoi_written=true, ack_delta=1, and pending_after=0 – each pending unit retires exactly one LAPIC EOI through the counter- based exclusion device_interrupt.rs documents at acknowledge_deferred_lapic_eoi_for_route / close_deferred_eoi_gate_and_drain. (3) Masked no-wake: after mask, record_lapic_delivery returns Masked { state: ClaimedMasked, .. } and delivery_count does not advance. (4) Reassign generation bump + stale handle: the prior handle’s validate_claimed_route returns StaleRoute; the stale pending token’s check_pending_lapic_token reports wake_blocked=true with either Unregistered (the live evidence case: reassign’s first_available_vector runs before clear_dispatch_slot retires the old vector, so the next pool slot is chosen and the stale token names an unregistered vector) or SourceRouteGenerationMismatch (the single-slot-pool degenerate case where reassign reused the same vector); and a fresh injected dispatch under the reassigned route + vector lands on the new generation while leaving the stale token blocked. (5) Release: release_claimed_route clears the slot and validate_claimed_route on the reassigned handle now fails closed with StaleRoute. Each outcome is logged on interrupt-delivery: live_delivery ... / ordered_acknowledge ... / masked_no_wake ... / reassign_stale ... / release ... lines so a regression trips the boot log alongside the missing marker.
- capOS mapping: route/source allocation + live delivery + ordered acknowledge + mask/unmask + reset/reassignment + stale-route-generation rejection, all on the production cloud kernel. The MSI-X table entry is programmed but the PCI function-level MSIX_CONTROL_ENABLE bit is intentionally NOT toggled (the proof never enables MSI-X on the function, so no real device-autonomous interrupt can fire on the programmed entry); the proof exits with the table entry re-masked. There is no userspace Interrupt waiter on the production cloud kernel yet, so the proof’s “waiter wake” boundary is the kernel-side dispatch slot a real provider waiter would consume — the marker reports waiter_wake=kernel-side-proxy rather than overclaiming a provider-cap-side wake. No provider-nic-bound or storage-bound marker. The cap::interrupt_delivery_proof caller is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s own Interrupt claim path; the qemu build keeps make run-interrupt-grant as the broader end-to-end exercise of the interrupt grant surface with the full DDF backend.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows the marker for the emulated virtio function. No GCE resources are created and no make cloudboot-test run is required. PBA handling is recorded by including the pba_bir and pba_offset from MsixCapabilityInfo in the proof’s ok line; the kernel does not read or clear PBA bits (devices set them, and this proof never enables the function so no PBA bit can be set in practice).
Production cloud-boot evidence marker (provider-nic-bound): the gate the billable GCE run consumes through tools/cloudboot/ NIC_PROOF_MARKER / --require-provider-nic-proof. It is sourced from real userspace driver progress: the marker fires only after the always-built polled virtio-net provider cap::virtio_net_polled_provider has completed a TX+RX over the live function and observed the RX completion by polling the latched used ring (zero kernel-injected interrupts). The predecessor staged its own DeviceMmio
- DMAPool/DMABuffer + MSI-X Interrupt grant surfaces at boot and proved the “queue-completion handoff” by calling device_interrupt::handle_lapic_delivery — a kernel-side dispatch-slot proxy (the inject_real_lapic_int_for_proof precedent). That proxy is removed as the source of the gate: cap::provider_nic_bind_proof::report now runs once at boot and emits no marker (it records the deferral to the real provider’s completion); the marker is emitted later from cap::provider_nic_bind_proof::report_real_completion, called from the provider’s release-time completion path.
- Spec basis: virtio 1.2 §2.7 split-ring used-ring semantics (the device writes a used element; the driver observes used.idx advance) — the completion the provider polls; virtio 1.2 §5.1.6 virtio-net receiveq frame layout (12-byte modern header + ethernet frame) for the EtherType read-back; inherited MSI-X table layout / mask-first ordering (PCI 3.0 §6.8.2) only for the release-time route assertion chain, which never delivers an interrupt on the completion path.
- Implemented wire-format subset: cap::virtio_net_polled_provider (staged when the booted manifest declares the cloud-provider-nic-bound-real-polled-driver-smoke binary) drives the modern virtio status sequence to DRIVER_OK, materializes the RX virtqueue (queue 0) + TX stimulus virtqueue (queue 1), holds the PCI function-level MSI-X enable mask-first, maps the notify region, and programs the RX MSI-X route over table entry 0 (used only by the release-time assertion chain). Its attempt_rx_submit (admitted from the userspace DMABuffer.submitDescriptor(queue=0)) publishes the RX descriptor (VIRTQ_DESC_F_WRITE), drives the ARP TX stimulus, polls the latched used ring for the one real device->host RX DMA, and resets the device; its invoke_wait reads the latched PublishedRx with delivery_count unchanged. report_real_completion then sources the provider-nic-bound token from that PublishedRx (used.idx, used[0].id, used[0].len, observed EtherType) plus the picked function identity.
- Fail-closed assertions: report_real_completion re-asserts the real RX completion facts independently of the provider’s own gate before the marker is emitted. (1) Real device->host RX DMA: the latched used ring advanced exactly once (used.idx == 1), the completion is the posted descriptor (used[0].id == 0), the device wrote a non-empty frame (used[0].len > 0), and the provider read back a non-zero EtherType. (2) Polled, not injected: the provider’s Interrupt.wait advanced no kernel dispatch (provider_observed_dispatch == 0) and retired no deferred LAPIC EOI (provider_observed_ack == 0). On any regression a provider-nic-bind: real-completion regression (no marker): ... line trips the boot log and no marker is emitted, so provider.json’s provider_nic_proof stays null and --require-provider-nic-proof fails closed.
- capOS mapping: the marker is now backed by real userspace driver progress on the production cloud kernel. It carries the real-provider labels waiter_wake=polled-used-ring, rx_completion=polled-used-ring, int_injected=0, userspace_driver_authority=present-real-polled-provider, virtio_common_config_write=performed, provider_tx_rx=completed, device_autonomous_raise=not-claimed, host_physical_user_visible=0, direct_dma=blocked, iova_export=disabled-future-only, and live_cloud=not-attempted — never the predecessor’s waiter_wake=kernel-side-proxy / userspace_driver_authority=absent-on-non-qemu. The RX queue_msix_vector stays VIRTIO_MSI_NO_VECTOR and the PCI function mask stays held, so the device cannot autonomously raise the MSI either; the completion stays polled. The literal system.cue fold (so a plain default cloudboot also emits provider-nic-bound from real progress without a focused manifest) is not yet implemented, to avoid perturbing the make run interactive shell/login boot; device-autonomous MSI-X is the parallel future work cloud-prod-virtio-net-rx-device-autonomous-msix-raise-local-proof.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-nic-bound-real-polled-driver on the default non-qemu kernel with no cloud_*_proof feature, on the make run-net device shape. No GCE resources are created; live_cloud=not-attempted.
Production cloud-boot evidence marker (virtio-net-device-bringup): the production boot path, under the focused-proof Cargo feature cloud_virtio_net_device_bringup_proof, drives a bounded virtio-net device bringup sequence kernel-side over the same virtio function the provider-nic-bound proof maps – but writes the virtio common-configuration status register (which provider-nic-bound never does). It is the first device-activation step toward the still-blocked cloud-gcp-virtio-net-nic-driver track. Emits one cloudboot-evidence: virtio-net-device-bringup <token> marker on the tools/cloudboot/ harness’s serial-port-1 path through make run-cloud-provider-virtio-net-bringup.
- Spec basis: virtio 1.2 §3.1.1 device initialization (reset, ACKNOWLEDGE, DRIVER, feature discovery + driver-feature select, FEATURES_OK re-read, DRIVER_OK), §4.1 (modern virtio over PCI: common / notify / ISR / device / PCI-cfg capabilities, common-config register layout from Table 4.1).
- Implemented wire-format subset: cap::virtio_net_device_bringup_proof::report (kernel/src/cap/virtio_net_device_bringup_proof.rs) picks the virtio-net PCI function (vendor VIRTIO_VENDOR_ID = 0x1af4, device VIRTIO_NET_TRANSITIONAL_DEVICE_ID = 0x1000 / VIRTIO_NET_MODERN_DEVICE_ID = 0x1041) from pci::enumerate, walks the modern virtio PCI vendor-capability chain through virtio_transport::parse_modern_pci_transport_capabilities, maps the resolved common-configuration region through pci::map_bar_region (UC + NX + GLOBAL + WRITABLE – same flags as the BAR-readback path), and drives the bringup using the shared MmioRegion accessors plus virtio_transport::{read_device_features, write_driver_features, STATUS_ACKNOWLEDGE, STATUS_DRIVER, STATUS_FEATURES_OK, STATUS_DRIVER_OK, STATUS_FAILED, VIRTIO_F_VERSION_1, COMMON_NUM_QUEUES}. The selected driver feature word is exactly VIRTIO_F_VERSION_1; no other device- or net-specific bit is accepted, so the proof never crosses into the queue-setup or descriptor surface the userspace virtio-net provider will own.
- Fail-closed assertions: four inline assertions gate the marker. (1) Negotiated feature set: the device’s offered 64-bit feature word advertises VIRTIO_F_VERSION_1, the written driver feature word equals exactly device_features & VIRTIO_F_VERSION_1. (2) Queue count visibility: the live COMMON_NUM_QUEUES read returns >= 2 (virtio-net always exposes RX + TX virtqueues, which this proof does not publish). (3) DRIVER_OK observation: the post-DRIVER_OK status read carries STATUS_ACKNOWLEDGE | STATUS_DRIVER | STATUS_FEATURES_OK | STATUS_DRIVER_OK set with STATUS_FAILED clear. (4) Final reset: a write of 0 to device_status reads back as 0. The proof wraps the status sequence so every exit path (success or any intermediate failure) writes 0 to device_status before returning, leaving the device in its post-reset state regardless of outcome. Per-stage outcomes log on virtio-net-device-bringup: ok ... / virtio-net-device-bringup: ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
- capOS mapping: focused-proof child of provider-nic-bound that extends the proven bind composition with virtio’s status sequence, kernel-side, over the same mapped BAR. The PCI function-level MSIX_CONTROL_ENABLE bit stays untoggled, no queue is published, no descriptor is written, no doorbell is rung, and no userspace virtio-net provider cap is issued. The marker’s trailing labels (queue_setup=not-attempted, tx_descriptor=not-published, userspace_cap=not-issued, msix_function_enable=not-toggled, device_autonomous_raise=not-attempted, live_cloud=not-attempted) re-anchor those bounds. Queue setup, descriptor publication, doorbell writes, and a userspace virtio-net provider on the production cloud boot manifest stay deferred to the still-blocked cloud-gcp-virtio-net-nic-driver track. The cap::virtio_net_device_bringup_proof caller is gated to cfg(all(not(feature = "qemu"), not(feature = "cloud_provider_cap_waiter_proof"), feature = "cloud_virtio_net_device_bringup_proof")); the qemu build keeps make run-net / make run-ddf-provider-consumer as the end-to-end exercise of the same surface with the full driver.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-bringup, which boots the focused-proof cloudboot kernel + manifest under QEMU and asserts the marker on serial. No GCE resources are created.
Production cloud-boot evidence marker (nic-driver-userspace-features-ok): the userspace virtio handshake step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo feature cloud_virtio_net_userspace_features_ok_proof, the cap::devicemmio_grant_source_prod source stages the picked virtio-net function’s modern virtio common-configuration window (resolved through virtio_transport::parse_modern_pci_transport_capabilities, mapped at the region’s first byte) as a writable selected-write DeviceMmio grant (stage_virtio_net_common_config). The userspace cloud-prod-nic-driver-userspace-features-ok-smoke service then drives the virtio device handshake from userspace – the authority delta from the kernel-side virtio-net-device-bringup proof, which drives the same sequence in the kernel.
- Authority delta: the handshake registers move from kernel-internal MMIO to a userspace driver over the existing DeviceMmio.read32/write32 path. The write admission (device_manager::stub::write_devicemmio_u32 under the feature) admits exactly four common-config registers – device_feature_select (0x00), driver_feature_select (0x08), driver_feature (0x0C), and device_status (0x14, written as a single byte) – each range-checked against the decoded BAR and kernel read-back-asserted (feature-register read-backs must echo the written value; device_status is left to the driver’s own re-read since the device may legitimately diverge). This is the same selected-write + range-check + read-back discipline the notify doorbell (notifyDoorbell @5) and the NVMe CC reset write (cloud_nvme_controller_reset_proof) already enforce – not a new write primitive.
- Fail-closed assertions: the shim drives reset -> ACKNOWLEDGE -> DRIVER -> read device features -> write the negotiated driver features (VIRTIO_F_VERSION_1 only) -> FEATURES_OK, re-reading device_status to confirm FEATURES_OK stuck, then proves a queue_desc (0x20) write fails closed (result=write-blocked register_write=blocked). The released cap fails closed on the next call. The headline cloudboot-evidence: nic-driver-userspace-features-ok <token> marker lands only after every assertion passes.
- capOS mapping: the handshake step of the Phase C userspace NIC driver relocation. Queue/vring and IRQ ownership stay kernel-owned: queue-address registers fail closed, so no buffer address is ever programmed (the userspace-ownable vring over the DMA-isolation track is the next capability below). The marker’s trailing labels (handshake=features-ok, queue_setup=not-attempted, queue_address_write=blocked, vring=not-owned, irq=not-owned, driver_ok=not-attempted, live_cloud=not-attempted) re-anchor those bounds. The feature is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, and cloud_virtio_net_device_bringup_proof.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-features-ok. No GCE resources are created.
Production cloud-boot evidence marker (nic-driver-userspace-ownable-vring): the userspace-owned vring step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo feature cloud_virtio_net_userspace_ownable_vring_proof (which implies the handshake feature cloud_virtio_net_userspace_features_ok_proof), the cap::devicemmio_grant_source_prod source stages the writable common-config window and cap::dmapool_grant_source_prod stages a bounce-buffer DMAPool on the same virtio-net function. The userspace cloud-prod-nic-driver-userspace-ownable-vring-smoke service drives the handshake to FEATURES_OK, then allocates and owns its own virtqueue rings.
- Authority delta: the queue-address-class registers move from kernel-internal MMIO (the virtio-net-tx-queue-materialization proof programs them in the kernel) to a userspace driver over the same DeviceMmio.write32 path. The write admission (device_manager::stub::write_devicemmio_u32 under the feature, admit_virtio_queue_address_write) admits queue_select (0x16) and queue_size (0x18) as range-checked pass-through selected writes, and the 64-bit queue_desc (0x20) / queue_driver (0x28) / queue_device (0x30) base registers via a token-resolve selected write: the driver writes the opaque per-buffer device-usable handle it learned from DMABuffer.info (deviceIova, scope bounce-handle), and the kernel resolves it against the live DMAPool grant ledger (resolve_virtio_vring_device_address) to the real bounce host-physical address, programs that address (never the handle, never an address the driver authored), and read-back-asserts. Reads of the queue-address base registers (0x20..0x38) are refused in read_devicemmio_u32, so the resolved host-physical address is never exposed to userspace (host_physical_user_visible stays 0). queue_enable stays fail-closed (it is armed by the queue-enable/DRIVER_OK capability below).
- Reuses landed DMA isolation: the ring pages are manager-owned DMAPool bounce buffers under the landed scrub-before-free / owner+slot generation / quiesce-before-release discipline (kernel/src/device_dma.rs); the no-host-physical-exposure posture (host_physical_user_visible=0, iova_export=disabled-future-only) is unchanged. This capability is wiring, not a new isolation backend. The opaque device-usable handle is a deterministic, non-address encoding of the buffer’s manager-owned identity under a fixed tag, so it can never collide with a page-aligned host-physical address and carries no host-physical information.
- Fail-closed assertions: the shim allocates its descriptor / available / used ring pages, programs each handle, then proves a queue-address read is refused and that an out-of-grant handle, a raw host-physical-looking value (0x40000000), and a stale (freed-buffer) handle each fail closed (result=write-blocked register_write=blocked) before any MMIO write. The released DeviceMmio cap fails closed on the next call. The headline cloudboot-evidence: nic-driver-userspace-ownable-vring <token> marker lands only after every assertion passes.
- capOS mapping: the userspace-owned vring step of the Phase C userspace NIC driver relocation. The marker’s trailing labels (vring=userspace-owned, queue_address_programming=token-resolved, host_physical_user_visible=0, provider_visible_queue_address=hidden, iova_export=disabled-future-only, out_of_grant=blocked, host_physical=blocked, stale_generation=blocked, queue_enable=not-attempted, driver_ok=not-attempted, irq=not-owned, live_cloud=not-attempted) re-anchor those bounds.
- QEMU-emulable vs hardware-only: fully QEMU-emulable (the bounce backend is the probe-selected default without a guest IOMMU). Proved locally by make run-cloud-prod-nic-driver-userspace-ownable-vring. No GCE resources are created.
Production cloud-boot evidence marker (nic-driver-userspace-queue-enable-driver-ok): the userspace queue-enable / DRIVER_OK step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo feature cloud_virtio_net_userspace_queue_enable_driver_ok_proof (which implies the ownable-vring feature cloud_virtio_net_userspace_ownable_vring_proof), the userspace cloud-prod-nic-driver-userspace-queue-enable-driver-ok-smoke service drives the handshake to FEATURES_OK and programs its owned vring exactly as the ownable-vring capability does, then completes device bring-up from userspace: it arms its programmed TX queue and writes DRIVER_OK.
- Authority delta: two more writes join the handshake/ownable-vring selected-write admission, both under the same range-check + read-back discipline. (1) queue_enable (0x1c, u16): a range-checked pass-through selected write, admitted by device_manager::stub::write_devicemmio_u32 (admit_virtio_queue_address_write) only when the active queue’s vring memory is live and page-fitting (selected_queue_ready_to_enable): the kernel reads the active queue_desc/queue_driver/queue_device back kernel-side and requires each to currently hold the host-physical address of a live granted DMABuffer on this device (a freed buffer’s stale address no longer matches a live buffer, so it cannot arm a use-after-free DMA target), and requires the active queue_size to fit every split-ring structure (16*size desc table, 6+8*size used ring, 6+2*size avail ring) inside one granted bounce page. An enable of an unprogrammed, freed, or oversized queue fails closed before any MMIO side effect; the enable is read-back-asserted. Once a queue is enabled its vring base registers are immutable – a queue-address repoint (even with an otherwise-valid live token) is refused (devicemmio-queue-address-immutable-after-enable) so the driver cannot mutate the vring under a running device. (2) DRIVER_OK (a bit in device-status 0x14): the device-status register is already writable (from the handshake capability), but setting DRIVER_OK is kernel-asserted – the kernel re-reads device-status and fails closed (devicemmio-driver-ok-not-observed) unless the device latched the full ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK byte exactly (rejecting FAILED 0x80, DEVICE_NEEDS_RESET 0x40, and any reserved bit), so a userspace driver cannot claim a brought-up device the hardware did not accept.
- Reuses landed DMA isolation: this capability adds no new register write primitive, no new isolation backend, and no host-physical exposure. It reuses the ownable-vring bounce / DMAPool / DeviceMmio grants and writable window unchanged; queue-address reads (0x20..0x38) stay refused (host_physical_user_visible=0). The enable binds to live, page-fitting, post-enable-immutable queue memory, so the device is never armed at a freed, oversized, or mutated vring.
- Bounded residual (handled by the RX bring-up capability below): the enable’s live + page-fit check is point-in-time and matches by host-physical address, not buffer identity; it does not pin the ring buffers against freeBuffer / process-teardown release while the queue is enabled. Both are use-after-free-DMA hazards only once a descriptor is posted and the doorbell rung – which this capability never does (frame_tx=not-attempted; the RX queue is never enabled; the TX queue is kick-driven), so no device DMA is reachable here and DMA stays confined to the granted bounce pool. Buffer-identity binding and pinning are the data path’s responsibility (vring_buffer_pinning=deferred-slice-4); tracked by the userspace RX/DMA task records.
- Fail-closed assertions: the shim proves the ownable-vring out-of-grant / host-physical / stale (freed-throwaway-buffer) queue-address writes fail closed and an enable of the unprogrammed RX queue (index 0) fails closed, then arms the programmed TX queue (queue_enable=1, register_write=performed), sets DRIVER_OK and re-reads device-status to confirm the full brought-up byte, and proves a post-enable queue-address repoint (with an otherwise-valid live token) fails closed. The released DeviceMmio cap fails closed on the next call. The headline cloudboot-evidence: nic-driver-userspace-queue-enable-driver-ok <token> marker lands only after every assertion passes.
- capOS mapping: the userspace queue-enable / DRIVER_OK step of the Phase C userspace NIC driver relocation. The marker’s trailing labels (vring=userspace-owned, queue_enable=performed, unprogrammed_queue_enable=blocked, device_brought_up=driver-ok, status_full=0f, driver_ok=observed, vring_live_bound=enforced, queue_size_fits_grant=enforced, post_enable_immutable=blocked, host_physical_user_visible=0, provider_visible_queue_address=hidden, frame_tx=not-attempted, nic_cap=not-implemented, irq=not-owned, live_cloud=not-attempted) re-anchor those bounds. The Nic-cap TX/RX round-trip (no frame crosses the wire here) and userspace IRQ ownership are later capabilities below.
- QEMU-emulable vs hardware-only: fully QEMU-emulable (the bounce backend is the probe-selected default without a guest IOMMU). Proved locally by make run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok. No GCE resources are created.
Production cloud-boot evidence marker (nic-driver-userspace-rx-bringup): the userspace RX-queue bring-up step of the Phase C NIC-driver relocation track. Under cloud_virtio_net_userspace_rx_bringup_proof (implies the queue-enable feature) the cloud-prod-nic-driver-userspace-rx-bringup-smoke service brings up the RX virtqueue (index 0) over its own vring – the handshake/vring/enable capabilities above brought up only the TX queue (index 1); the queue_enable admission is queue-agnostic, so RX bring-up reuses it.
- capOS mapping: the kernel (device_manager::stub) retains each programmed queue’s vring physes + originating DMABuffer handle identity on ProductionDeviceRecord (admit_virtio_queue_address_write), binds queue_enable to that identity (selected_queue_identity_bound; a freed/realloc’d handle fails closed with devicemmio-queue-enable-identity-mismatch), and pins the ring buffers against freeBuffer / process-teardown release while the queue is enabled (blocked_pinned_enabled_vring → dmabuffer-pinned-enabled-vring), released only on queue disable/reset with device quiesce. This closes the queue-enable capability’s pre-migration buffer-lifetime/identity residual at the bring-up boundary. Marker labels: rx_queue_brought_up=driver-ok, buffer_identity_bound=enforced, vring_buffer_pinning=enforced, pinning_free_while_enabled=blocked, int_injected=0, nic_cap=not-implemented, irq=not-owned, live_cloud=not-attempted.
- First real RX DMA: the same feature also drives the first real RX DMA from the shim-owned vring. The shim also brings up TX queue 1 over its own vring, posts one device-writable RX receive buffer on queue 0 (DMABuffer.submitDescriptor), and rings the production DeviceMmio.notifyDoorbell @5. capOS mapping: the kernel maps the notify region kernel-side and captures the per-queue notify slot offsets (cap::devicemmio_grant_source_prod), and provider_notify_doorbell_write_for_cap (was Err(stale_handle)) is now a live drive; the RX-DMA flow (cap::virtio_net_userspace_rx_dma_proof, byte-level vring helpers duplicated from cap::virtio_net_polled_provider to protect run-net) writes the RX descriptor + avail over the shim’s retained RX vring physes, rings the RX doorbell, submits a kernel-half ARP request over the shim’s retained TX vring physes, polls one real device->host completion (used_len > 0, observed EtherType 0x0806), resets the device (quiescing both queues and releasing the ring-buffer pins via mark_retained_vring_queue_disabled), and latches the used-ring index. Completion stays kernel-latched used-ring polled (int_injected=0, no Interrupt cap). Marker labels add tx_queue_brought_up=driver-ok, frame_rx=performed, rx_used_ring=kernel-latched. The kernel emits one virtio-net-userspace-rx-dma: rx_dma=performed ... used_len=<n> ethertype=0x0806 device_reset=ok queues_cleared=ok int_injected=0 evidence line.
- Not yet implemented: the deterministic freed-then-reallocated-frame identity negative (identity_realloc_negative=deferred-needs-allocator-reuse-seam). The capos-lib FrameBitmap is next-fit and free_frame does not rewind next_hint, so the allocation after a free never returns the just-freed frame; a deterministic same-phys realloc (needed to reach the buffer-identity gate rather than the host-physical gate) requires an allocator reuse seam. Tracked by cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.
- Future work: the Nic-cap round-trip (the next capability below, unblocked by this data path) and userspace IRQ ownership.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-rx-bringup. No GCE resources are created.
Production cloud-boot evidence marker (nic-driver-userspace-nic-cap-roundtrip): the Nic-cap round-trip step of the Phase C NIC-driver relocation track. It implements the handshake-step Nic interface stub as a live CapObject. Under the cloud_virtio_net_userspace_nic_cap_roundtrip_proof feature (implies the RX-bring-up feature) the cloud-prod-nic-driver-userspace-nic-cap-roundtrip-smoke service brings the device fully up from userspace (RX queue 0 + TX queue 1 enabled, DRIVER_OK), then holds a typed Nic cap and round-trips two sequential frames. capOS mapping:
- The new nic KernelCapSource (registered in capos-config manifest.rs
 - lib.rs::NIC_INTERFACE_ID + the nic @49 schema/capos.capnp KernelCapSource enum value; client NicClient in capos-rt) is granted from cap::nic_grant_source_prod, which maps the picked virtio-net function’s device-config window kernel-side for macAddress/linkStatus and binds the Nic cap to that BDF.
- transmit/receive drive the shim’s retained vring physes through cap::virtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive} (reusing the RX-bring-up byte-level vring helpers) with manager-owned kernel bounce payloads – not a shim-submitted DMABuffer – so a frame crosses the cap boundary as inline Data with host_physical_user_visible = 0 and no device-usable handle exported. receive drives the coupled ARP-TX-stimulus + RX-poll and returns the frame inline + observed EtherType; transmit stages a frame into a manager-owned TX page and rings notify_doorbell @5.
- The device is left live for the cap’s lifetime (a monotonic per-queue avail cursor lets transmit and receive compose without re-enabling) and quiesced once on cap release (nic_quiesce: device reset + queues-cleared assertion + mark_retained_vring_queue_disabled to release the enabled-vring pins). Completion stays kernel-latched used-ring polled (int_injected = 0, no Interrupt cap); no new selected-write register beyond the landed handshake / ownable-vring / queue-enable set. The kernel emits two virtio-net-userspace-nic-cap: receive ... used_len=<n> ethertype=0x0806 int_injected=0 host_physical_user_visible=0 evidence lines and a virtio-net-userspace-nic-cap: quiesce ... device_reset=ok queues_cleared=ok line on release. The proof also covers lifecycle ordering: a DMAPool cap release while ring buffers are still live records pending-buffer-release, an early release of one pinned ring DMABuffer records dmabuffer-pinned-enabled-vring, Nic quiesce replays that buffer detach after the queues are reset, and the pending parent pool release completes only after the remaining ring buffers are freed.
- Future work: the clean independent TX/RX split and userspace IRQ ownership (both later capabilities below).
- QEMU-emulable vs hardware-only: fully QEMU-emulable (the RX reply is QEMU SLIRP’s ARP answer to the kernel-half stimulus). Proved locally by make run-cloud-prod-nic-driver-userspace-nic-cap-roundtrip. No GCE resources are created.
Production cloud-boot evidence marker (nic-driver-userspace-irq-ownership): the userspace RX-interrupt-lifecycle ownership step of the Phase C NIC-driver relocation track. It gives the userspace NIC driver real RX-interrupt-lifecycle ownership. The Nic-cap round-trip capability above has int_injected = 0 and no Interrupt cap on the data path; this capability adds a real Interrupt cap whose wait/acknowledge/mask/unmask drive the route’s MSI-X vector-control + deferred LAPIC EOI (the frame bytes still arrive via Nic.receive’s used-ring read). Under the cloud_virtio_net_userspace_irq_ownership_proof feature (implies the nic-cap-roundtrip feature) the cloud-prod-nic-driver-userspace-irq-ownership-smoke service holds a DeviceMmio + DMAPool + Nic + Interrupt cap on the same virtio-net function. capOS mapping:
- A new Interrupt grant source (cap::virtio_net_userspace_irq_ownership_proof) replaces the admission-only interrupt_grant_source_prod source via the KernelCapSource::Interrupt arm under this feature. At boot it programs the staged virtio-net function’s RX MSI-X route (table entry 0) mask-first through the landed always-built cap::interrupt_programmed::program_attach_arm_unmask (route register / claim / MSI-X table map+write / device-manager attach / deferred-LAPIC-EOI arm / unmask) and tears it down (teardown) on cap release.
- The Interrupt cap’s methods are real for this device RX route: wait blocks on a real interrupt dispatch through the route’s MSI-X / LAPIC dispatch slot (device_interrupt::wait_kernel_injected_dispatch; delivery_count advances, so int_injected flips from 0 – the Nic-cap round-trip capability had no Interrupt cap on the data path at all). The wake is a bounded kernel-injected dispatch (not yet a device-autonomous raise causally tied to a frame), and Nic.receive still reads the frame bytes from the used ring, so the delta is IRQ-lifecycle ownership (real wait/acknowledge/mask/unmask), not interrupt-coalesced RX completion. acknowledge retires exactly one deferred LAPIC EOI through device_interrupt::acknowledge_deferred_lapic_eoi_for_route (hardwareDispatchAckDelta = 1, the one-ack-per-delivery / hardware_eoi_delta invariant); mask/unmask toggle the route’s own MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2: table-mask then route-state on mask; route-state then table-unmask on unmask) through pci::set_msix_table_entry_mask + device_interrupt::{mask,unmask}_device_manager_attached_route (driver-unmasked <-> claimed-masked).
- The driver brings the device up from userspace (the nic-cap-roundtrip bring-up verbatim), drives the owned RX-interrupt lifecycle (info/wait/acknowledge/mask/ unmask/release), and reads the completed frame back through Nic.receive (inline Data, host_physical_user_visible = 0). The PCI function-level MSI-X enable bit is not toggled and no device-autonomous raise is attempted (device_autonomous_raise=not-attempted, waiter_wake=kernel-injected-dispatch); the landed DMA isolation, the owned-vring grants, and buffer-identity / ring-buffer pinning are reused unchanged (queue-address reads still refused). No new Interrupt interface or method.
- Future work: the clean independent TX/RX split (the next capability below); the device-autonomous MSI-X raise (program the device RX queue_msix_vector + clear the PCI function mask) and the smoltcp network-stack relocation.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-irq-ownership (one cloudboot-evidence: nic-driver-userspace-irq-ownership <token> marker). No GCE resources are created.
Production cloud-boot evidence marker (nic-driver-userspace-clean-tx-rx-split): the independent-TX/RX step of the Phase C NIC-driver relocation track. It decouples the last data-path coupling – the userspace NIC driver’s Nic.transmit and Nic.receive become truly independent. In the nic-cap-roundtrip / IRQ-ownership capabilities, Nic.receive (virtio_net_userspace_rx_dma_proof::nic_receive) self-stimulated by submitting a kernel-half ARP TX over the retained TX vring inside the same call. Under the cloud_virtio_net_userspace_clean_tx_rx_split_proof feature (implies the irq-ownership feature) the Nic cap’s receive @1 dispatches instead to nic_receive_independent. capOS mapping:
- nic_receive_independent posts a manager-owned device-writable RX buffer on the retained RX vring, rings the RX doorbell, waits on the driver’s OWNED RX interrupt route (the IRQ-ownership device_interrupt::wait_kernel_injected_dispatch dispatch slot, resolved through virtio_net_userspace_irq_ownership_proof::owned_rx_route; int_injected flips from 0), retires the deferred LAPIC EOI (acknowledge_deferred_lapic_eoi_for_route), then polls the RX used ring and reads the completed frame – with no internal ARP-TX self-stimulus (it never submits to the TX vring; the kernel diagnostic reports tx_submissions=0 self_stimulus=removed).
- The RX frame is driven by an external stimulus: the consumer’s preceding independent Nic.transmit of a real broadcast ARP request (who-has the QEMU SLIRP gateway 10.0.2.2). SLIRP answers; the inbound reply is held in the host net queue until receive posts the RX buffer + kicks the RX queue.
- Nic.transmit stays independent: it submits the caller’s frame to the TX vring and rings the TX doorbell with no RX involvement (the kernel diagnostic reports rx_polls=0 rx_submissions=0). Neither call performs the other’s submission.
- The wake stays the bounded kernel-injected dispatch the IRQ-ownership capability owns (waiter_wake=kernel-injected-dispatch, device_autonomous_raise=not-attempted). The landed owned-vring / owned-IRQ / DMA-isolation, the writable common-config window, and the buffer-identity / ring-buffer pinning are reused unchanged: no new selected-write register, no new MSI-X surface, no new Nic/Interrupt method, no host-physical / handle exposure (host_physical_user_visible = 0, queue-address reads refused).
- Follow-up work: DHCP/IPv4 configuration, legacy kernel socket-path retirement, kernel smoltcp / virtio-net hot-path removal, and the device-autonomous MSI-X raise. The 7c-ii(b) serve-from-userspace local proof is now landed.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-clean-tx-rx-split (one cloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token> marker). No GCE resources are created.
Production cloud-boot evidence marker (nic-driver-userspace-sustained-receive-pool): Phase C slice 7d (DONE 2026-06-04) adds the sustained-receive Nic ABI the multi-frame TCP path (7c-iii) needs. The landed receive @1 is single-frame + reset-on-empty-poll; this adds a non-resetting poll over a kernel-owned bounce RX pool. Under the cloud_virtio_net_userspace_sustained_receive_pool_proof feature (implies the clean-split feature) the Nic cap serves receivePoll @4 (cap::nic_grant_source_prod -> virtio_net_userspace_rx_dma_proof::nic_receive_poll). capOS mapping:
- Arm. On first receivePoll the kernel allocates NIC_RX_POOL_SIZE manager-owned bounce RX frames (frame::alloc_frame_zeroed), posts one device-writable descriptor + avail entry per frame on the retained RX vring, publishes avail.idx, and rings the RX doorbell. The device masters only into these kernel-private pages; no host-physical or device-usable address is exported (host_physical_user_visible = 0).
- Drain one per poll. Each receivePoll re-kicks the RX doorbell (so QEMU flushes a queued inbound frame into an armed buffer during the MMIO VM exit) and reads the RX used ring. If it advanced, the kernel copies the frame out into the inline Data reply (bounded by the posted buffer length) and recycles that bounce slot.
- The per-buffer invariant replacing reset-before-reclaim. A bounce slot is re-exposed to the device only after its copy-out completes and its slot generation is bumped, with the slot scrubbed before the re-post – the production handle-epoch slot identity (docs/dma-isolation-design.md) applied at recycle granularity instead of device-reset granularity. The device is not reset per frame (device_reset=none); teardown (on_release via nic_quiesce, or an unprovable in-flight-DMA error) still quiesces (reset + queues cleared) and scrubs + frees the pool.
- No frame yet. If the used ring did not advance, receivePoll returns framePresent = false with no reset and the device stays armed (device_armed=true) – the cheap speculative poll a smoltcp phy::Device RX token needs. receive @1 semantics are unchanged.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-sustained-receive-pool (one cloudboot-evidence: nic-driver-userspace-sustained-receive-pool <token> marker after draining more than one frame with at least one non-resetting empty poll). No GCE resources are created.
- Follow-up work: DHCP/IPv4 configuration consumes the served socket path; later cleanup removes or fixture-gates the legacy kernel socket path and kernel smoltcp / virtio-net hot path. The 7c-ii(b) production manifest proof now consumes the userspace-served TcpListenAuthority, TcpListener, and TcpSocket substrate locally.
Production cloud-boot evidence marker (network-stack-process-smoltcp-skeleton): Phase C slice 7a (first increment, DONE 2026-06-03) is the first time a real TCP/IP stack runs outside the kernel over the relocated NIC authority. A userspace network-stack process builds an smoltcp Interface (Ethernet medium, MAC from Nic.macAddress, static IPv4 10.0.2.15/24) over a phy::Device adapter whose RX/TX is the slice-6 independent Nic.receive/Nic.transmit, clocked by the Timer cap monotonic source (monotonic_ns). capOS mapping:
- The phy::Device adapter is buffered: outbound frames smoltcp produces queue in a process-local Vec that the poll loop drains and submits via Nic.transmit; one inbound frame fetched via Nic.receive is handed back for smoltcp to consume. The adapter holds no vring, DMA handle, or host-physical address – every frame is a process-local byte buffer crossing the cap boundary as inline Data through the manager-owned bounce page (host_physical_user_visible = 0).
- The proof is that smoltcp – not hand-rolled frame code – drives the exchange: a UDP datagram queued to the on-link gateway makes smoltcp emit an ARP request (out through Nic.transmit), the SLIRP ARP reply is consumed (in through Nic.receive, EtherType 0x0806), and – with the neighbour now resolved – smoltcp emits the queued IPv4/UDP datagram, so the neighbour cache observably advances (smoltcp_tx_arp>=1, smoltcp_rx_consumed>=1, smoltcp_tx_ipv4>=1). The internal smoltcp UDP socket is only an egress stimulus; no socket capability is exposed.
- Implementation note: the landed Nic cap rides on the userspace driver shim’s retained vring (the kernel does not own the vring), so the skeleton process performs the slice-1-6 bring-up itself before running smoltcp. Relocating the bring-up to a separate long-lived NIC-driver service is folded into the slice-7c contract relocation.
- Out of scope: the socket caps (slice 7b), the cap/network.rs contract relocation (slice 7c, virtio_stub.rs stays fail-closed), and the kernel smoltcp / virtio-net removal (slice 8).
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-network-stack-process-smoltcp-skeleton (one cloudboot-evidence: network-stack-process-smoltcp-skeleton <token> marker). No GCE resources are created.
Production cloud-boot evidence marker (network-stack-smoltcp-socket-caps): Phase C slice 7b (DONE 2026-06-03) adds a userspace UdpSocket cap layer on top of the slice-7a substrate: the userspace network-stack process now implements the UdpSocket schema’s sendTo/recvFrom semantics (UdpSocketCapLayer) over the same smoltcp Interface and proves one bounded UDP request/response through it. capOS mapping:
- The socket layer drives the slice-7a phy::Device/Nic pump: sendTo resolves the destination’s on-link ARP (one Nic.receive for the guaranteed ARP reply, EtherType 0x0806) and transmits the datagram through Nic.transmit; recvFrom fetches the single solicited reply datagram through Nic.receive (EtherType 0x0800) and returns it. Frames stay process-local byte buffers (host_physical_user_visible = 0); a queue-address read stays refused.
- The request/response is a DNS A query for example.com to SLIRP’s built-in resolver at 10.0.2.3:53 (the same resolver the C posix-dns-resolver smoke uses); the decoded response is returned through recvFrom and the proof asserts source 10.0.2.3:53, the transaction-id/QR/RCODE correlation, and a decoded A record. The landed Nic.receive resets the device on an empty poll, so the proof only receives when a reply is guaranteed pending and spaces a Timer pre-delay before the datagram receive.
- Honest boundary: the socket layer is in-process – it implements the socket interface semantics over the userspace stack but does not yet serve them as inter-process transferable capabilities, and it does not touch the production kernel/src/cap/network.rs contract (virtio_stub.rs stays fail-closed). Preserving that contract behind a userspace network-stack service is slice 7c.
- QEMU-emulable vs hardware-only: fully QEMU-emulable (relies on SLIRP’s DNS forwarder). Proved locally by make run-cloud-prod-network-stack-smoltcp-socket-caps (one cloudboot-evidence: network-stack-smoltcp-socket-caps <token> marker). No GCE resources are created.
Production cloud-boot evidence marker (userspace-network-stack-smoltcp): Phase C slice 7c, first increment (DONE 2026-06-03) serves the slice-7b UdpSocketCapLayer as a real inter-process transferable capability. capOS mapping:
- A network-stack server process holds the bring-up caps plus an exported Endpoint; after bring-up it serves the UdpSocket schema (sendTo/recvFrom/close) over that endpoint, driving the same UdpSocketCapLayer on its own ring (decoding/encoding the capnp params and results). A separate client process holds only Console and the served cap; it re-interprets the imported Endpoint as a UdpSocket and drives one bounded DNS A query/response through the production UdpSocketClient.
- smoltcp still moves every frame through the Nic cap (ARP reply EtherType 0x0806 + DNS reply 0x0800 through Nic.receive); host_physical_user_visible = 0 is preserved and a queue-address read stays refused. On close the server releases its owned RX Interrupt (route_torn_down=ok).
- Honest boundary: the UdpSocket contract lives behind a userspace network-stack service. Later Phase C increments added the TcpListener / TcpSocket substrate, inter-process serving, and the local 7c-ii(b) serve-from-userspace manifest proof for TcpListenAuthority. DHCP/IPv4, Web UI L4, private GCE reachability, public ingress, and legacy kernel-socket cleanup remain separate work.
- QEMU-emulable vs hardware-only: fully QEMU-emulable (relies on SLIRP’s DNS forwarder). Proved locally by make run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc (one cloudboot-evidence: userspace-network-stack-smoltcp <token> marker). No GCE resources are created.
Production cloud-boot evidence marker (virtio-net-tx-authority-bundle): under the focused-proof Cargo feature cloud_virtio_net_tx_authority_bundle_proof, the cloudboot kernel layers a bundle observer (cap::virtio_net_tx_authority_bundle_proof) on top of the three existing production grant sources (devicemmio_grant_source_prod, dmapool_grant_source_prod, interrupt_grant_source_prod). Under the feature, the DeviceMmio source filters its PCI candidate to the same virtio/NVMe-class function the DMAPool and Interrupt sources already match, so all three grants land on the same virtio-net function. Exposed through make run-cloud-provider-virtio-net-tx-authority-bundle.
- Implemented wire-format subset: no new MMIO/DMA/IRQ writes. The bundle reuses the existing prod sources’ grant + per-cap on-release surfaces and asserts the bundle identity over their issue and release notifications via the record_devicemmio_grant/record_dmapool_grant/ record_interrupt_grant/record_devicemmio_release/ record_dmapool_release/record_interrupt_release hooks called from the existing build_cap_for_grant / on_release / release_cap paths.
- Fail-closed assertions: the userspace cloud-provider-virtio-net-tx-authority-bundle-smoke service calls info on each of the three caps and asserts they all report the same BDF. The kernel-side bundle observer records each grant’s (bdf, generation) identity at issue and at release; the headline cloudboot-evidence: virtio-net-tx-authority-bundle <token> marker is emitted only after all three caps have been issued and released and same_dm/same_dp/same_ir/same_bdf all evaluate true. A BDF mismatch logs virtio-net-tx-authority-bundle: assertion regression: ... and leaves the marker unprinted. Per-cap stale-handle fail-closed is inherited from the existing prod sources’ validate_*_record paths; the smoke re-tests it explicitly after each release.
- capOS mapping: bundle authority composition over the DeviceMmio + DMAPool + Interrupt grant arms; first child of the blocked cloud-prod-virtio-net-userspace-provider-tx-local-proof parent. No virtio queue setup, no descriptor publication, no notify doorbell, no PCI function-level MSI-X enable, no Interrupt.wait, no TX completion claim, no live cloud traffic. The marker’s trailing labels (same_bdf=true, queue_setup=not-attempted, tx_descriptor=not-published, notify=not-rung, msix_function_enable=not-toggled, tx_completion=not-claimed, live_cloud=not-attempted) re-anchor those bounds. The bundle feature is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, and cloud_virtio_net_device_bringup_proof at the cap::mod.rs activation site.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-tx-authority-bundle. No GCE resources are created.
Production cloud-boot evidence marker (virtio-net-tx-queue-materialization): under the focused-proof Cargo feature cloud_virtio_net_tx_queue_materialization_proof, the cloudboot kernel runs cap::virtio_net_tx_queue_materialization_proof (kernel/src/cap/virtio_net_tx_queue_materialization_proof.rs) over the same virtio-net function the authority bundle picks. The proof materializes one manager-owned TX virtqueue: it allocates three zeroed physical frames from the kernel frame allocator, programs the TX queue’s common-configuration QUEUE_DESC / QUEUE_DRIVER / QUEUE_DEVICE + QUEUE_ENABLE = 1, asserts the device read-backs match the manager-authored host-physical addresses, then writes 0 to device_status and asserts every TX queue-state register has cleared to 0. Exposed through make run-cloud-provider-virtio-net-tx-queue-materialization.
- Spec basis: virtio 1.2 §2.1.2 (reset clears all virtqueue state), §2.7 (split-ring queue layout), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: the proof drives the modern virtio status sequence through reset / ACK / DRIVER / feature select (VIRTIO_F_VERSION_1 only) / FEATURES_OK, asserts COMMON_NUM_QUEUES >= 2, writes COMMON_QUEUE_SELECT = 1 (TX), reads COMMON_QUEUE_SIZE, clamps to a power-of-two bound (MAX_QUEUE_SIZE = 256, so each region fits in one 4 KiB frame), allocates desc/avail/used frames through mem::frame::alloc_frame_zeroed, programs COMMON_QUEUE_DESC / COMMON_QUEUE_DRIVER / COMMON_QUEUE_DEVICE with the resulting host-physical addresses + COMMON_QUEUE_ENABLE = 1, reads every queue register back through the MmioRegion accessors (the proof grew a read_u64 companion to the existing write_u64) and asserts the values match, sets DRIVER_OK, then writes 0 to device_status and asserts post-reset COMMON_QUEUE_ENABLE / ..._DESC / ..._DRIVER / ..._DEVICE are all 0 after re-selecting queue 1. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.-len.<hex>-q.<index>-size.-desc.<hex>-drv.<hex>-dev.<hex>.
- Fail-closed assertions: five inline assertions gate the marker. (1) Initial reset reads back as 0. (2) Negotiated feature set matches exactly VIRTIO_F_VERSION_1. (3) Post-DRIVER_OK status reads back with ACK|DRIVER|FEATURES_OK|DRIVER_OK set and FAILED clear. (4) Programmed queue addresses + enable read back exactly as written. (5) Post-reset re-read of the TX queue state reports every queue-state register cleared to 0. The proof wraps the materialization so every exit path (success or any intermediate failure) writes 0 to device_status and frees every allocated frame back to the bitmap before returning. Per- stage outcomes log on the virtio-net-tx-queue-materialization: ok ... / ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
- capOS mapping: focused-proof child of the TX authority bundle that extends the proven bundle composition with one round of real common-configuration queue setup + reset cleanup. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; the bundle observer compiles in (the picker filter is_bundle_candidate_class fires under either feature) so every grant + release identity still pairs up for the debug trail, but the bundle’s headline marker is intentionally suppressed under this feature because its queue_setup=not-attempted claim would be inaccurate now. The queue-materialization marker’s trailing labels (tx_descriptor=not-published, notify=not-rung, msix_function_enable=not-toggled, tx_completion=not-claimed, provider_visible_queue_address=hidden, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (descriptor publication, notify doorbell, MSI-X function enable, userspace submit, used-ring polling, live cloud) carry. The cap::virtio_net_tx_queue_materialization_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-tx-queue-materialization. No GCE resources are created.
Production cloud-boot evidence marker (virtio-net-rx-queue-materialization): under the focused-proof Cargo feature cloud_virtio_net_rx_queue_materialization_proof, the cloudboot kernel runs cap::virtio_net_rx_queue_materialization_proof (kernel/src/cap/virtio_net_rx_queue_materialization_proof.rs) over the same virtio-net function the authority bundle picks. It is the structural mirror of the TX queue-materialization proof, one virtqueue index over: it materializes one manager-owned RX virtqueue (queue index 0) instead of the TX virtqueue (queue index 1). The proof allocates three zeroed physical frames from the kernel frame allocator, programs the RX queue’s common-configuration QUEUE_DESC / QUEUE_DRIVER / QUEUE_DEVICE + QUEUE_ENABLE = 1, asserts the device read-backs match the manager-authored host-physical addresses, then writes 0 to device_status and asserts every RX queue-state register has cleared to 0. Exposed through make run-cloud-provider-virtio-net-rx-queue-materialization.
- Spec basis: virtio 1.2 §2.1.2 (reset clears all virtqueue state), §2.7 (split-ring queue layout), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: identical to the TX queue-materialization proof except it writes COMMON_QUEUE_SELECT = 0 (RX) instead of 1 (TX) and re-selects queue 0 for the post-reset read-back. The proof drives the modern virtio status sequence through reset / ACK / DRIVER / feature select (VIRTIO_F_VERSION_1 only) / FEATURES_OK, asserts COMMON_NUM_QUEUES >= 2, reads COMMON_QUEUE_SIZE, clamps to a power-of-two bound (MAX_QUEUE_SIZE = 256, so each region fits in one 4 KiB frame), allocates desc/avail/used frames through mem::frame::alloc_frame_zeroed, programs COMMON_QUEUE_DESC / COMMON_QUEUE_DRIVER / COMMON_QUEUE_DEVICE + COMMON_QUEUE_ENABLE = 1, reads every queue register back through the MmioRegion accessors and asserts the values match, sets DRIVER_OK, then writes 0 to device_status and asserts post-reset COMMON_QUEUE_ENABLE / ..._DESC / ..._DRIVER / ..._DEVICE are all 0 after re-selecting queue 0. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.-len.<hex>-q.<index>-size.-desc.<hex>-drv.<hex>-dev.<hex> (with q.0 for RX).
- Fail-closed assertions: the same five inline assertions gate the marker as in the TX proof — initial reset reads back 0; negotiated feature set is exactly VIRTIO_F_VERSION_1; post-DRIVER_OK status has ACK|DRIVER|FEATURES_OK|DRIVER_OK set and FAILED clear; programmed queue addresses + enable read back exactly as written; post-reset re-read of the RX queue state reports every queue-state register cleared to 0. The proof wraps the materialization so every exit path (success or any intermediate failure) writes 0 to device_status and frees every allocated frame back to the bitmap before returning. Per-stage outcomes log on the virtio-net-rx-queue-materialization: ok ... / ... failed closed: ... lines.
- capOS mapping: focused-proof sibling of the TX queue-materialization proof that drives the same kernel-side queue setup + reset cleanup against the receive virtqueue. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; the bundle observer compiles in through its shared cfg gate so every grant + release identity still pairs up for the debug trail, but the bundle’s headline marker is intentionally suppressed under this feature (it is gated on cloud_virtio_net_tx_authority_bundle_proof, which this feature does not enable) because its queue_setup=not-attempted claim would be inaccurate now. The RX-queue-materialization marker’s trailing labels (rx_buffer=not-posted, avail=not-published, notify=not-rung, rx_completion=not-claimed, msix_function_enable=not-toggled, provider_visible_queue_address=hidden, iova_export=disabled-future-only, device_autonomous_raise=not-claimed, live_cloud=not-attempted) re-anchor the bounds the descendant slices (receive-buffer post, avail publication, notify doorbell, used-ring consumption, MSI-X function enable, live cloud) carry. The cap::virtio_net_rx_queue_materialization_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, cloud_virtio_net_tx_authority_bundle_proof, and every TX proof feature at the cap::mod.rs activation site.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-queue-materialization. No GCE resources are created. Live-GCE RX stays under cloud-gcp-virtio-net-nic-driver.
Production cloud-boot evidence marker (virtio-net-rx-buffer-post): under the focused-proof Cargo feature cloud_virtio_net_rx_buffer_post_polled_completion_proof (which implies cloud_virtio_net_rx_queue_materialization_proof so the bundle observer
- production grant-source pickers + userspace bundle smoke keep their plumbing), the cloudboot kernel runs cap::virtio_net_rx_buffer_post_polled_completion_proof (kernel/src/cap/virtio_net_rx_buffer_post_polled_completion_proof.rs) over the same virtio-net function the authority bundle picks. It is the RX analogue of the TX submit-doorbell -> polled-completion progression: it materializes the RX virtqueue (queue index 0) AND the TX virtqueue (queue index 1, the SLIRP stimulus path), sets DRIVER_OK, posts ONE manager-owned device-writable receive buffer to the RX avail ring, rings the RX notify doorbell once, fills and TX-submits ONE broadcast ARP request as the SLIRP stimulus, rings the TX notify doorbell once, then polls the manager-owned RX used ring with a bounded spin budget until used.idx == 1 and asserts used[0].id == 0 and used[0].len > 0 — ONE real device->host RX DMA landed in the manager-owned bounce page. Exposed through make run-cloud-provider-virtio-net-rx-buffer-post.
- Spec basis: virtio 1.2 §2.1.2 (reset clears virtqueue state), §2.7 (split-ring queue layout), §2.7.6 (available ring: flags @0, idx @2, ring @4), §2.7.7 (VIRTQ_AVAIL_F_NO_INTERRUPT), §2.7.8 (used ring layout), §4.1.4.3 (common configuration queue registers), §4.1.5.2 (notify doorbell), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1), §5.1.6 (12-byte modern virtio-net header).
- Implemented wire-format subset: stages 1-9 materialize the RX queue (index 0) and the TX queue (index 1) identically to the queue- materialization proof (modern status sequence to FEATURES_OK, VIRTIO_F_VERSION_1 only, COMMON_NUM_QUEUES >= 2, clamp COMMON_QUEUE_SIZE to a power of two <= MAX_QUEUE_SIZE = 256, allocate desc/avail/used frames per queue, program the per-queue registers + QUEUE_ENABLE = 1, read-back assert), then DRIVER_OK. The RX-DMA delta authors RX descriptor slot 0 over the HHDM (addr = rx_payload_phys, len = 2048, flags = VIRTQ_DESC_F_WRITE, next = 0), sets the RX avail ring (flags = VIRTQ_AVAIL_F_NO_INTERRUPT, ring[0] = 0, release fence, idx = 1), maps the modern notify region bounded to the smallest page covering both per-queue notify slots, and rings the RX-queue notify doorbell. The stimulus mirrors virtio.rs::write_arp_request_frame: it fills one TX payload frame with a broadcast ARP request for the SLIRP gateway IP (10.0.2.2), authors TX descriptor slot 0 (flags = 0, device-readable), sets the TX avail ring (also VIRTQ_AVAIL_F_NO_INTERRUPT), and rings the TX-queue notify doorbell. The proof then polls the RX used.idx with a bounded spin budget (POLL_USED_RING_BUDGET = 50_000_000, an order of magnitude above the in-kernel ARP_RX_POLL_LIMIT = 500_000) and reads the device-authored used[0].(id, len) plus the observed EtherType. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.-len.<hex>-q.<index>-size.-rxnotify.bar..off.<hex>.mult..addr.<hex>-rxdesc.<hex>-rxdrv.<hex>-rxdev.<hex>-rxpay.<hex>-rxlen.-availidx.-usedidx.-usedid.-usedlen.-ethertype.<hex>.
- Fail-closed assertions: the queue-materialization assertions gate both queue setups (initial reset reads 0; negotiated features exactly VIRTIO_F_VERSION_1; post-DRIVER_OK status has ACK|DRIVER|FEATURES_OK|DRIVER_OK set and FAILED clear; programmed queue addresses + enable read back exactly as written). The RX-DMA delta adds: the polled used.idx reaches 1 within the spin budget (else fail closed, no marker); the post-completion RX avail.idx reads back 1; used[0].id == 0 (the published descriptor head); used[0].len > 0 (a real device->host frame); and the post-reset re-read of both the RX and TX queue-state registers reports every register cleared to 0. The proof resets the device on every exit path (success or any intermediate failure) and frees the eight manager-owned frames (RX desc/avail/used, TX desc/avail/used, RX payload, TX payload) only after a confirmed reset read-back of 0; if reset cannot be confirmed the frames stay retained so the device cannot DMA into a freed page. Per-stage outcomes log on the virtio-net-rx-buffer-post: ok ... / ... failed closed: ... lines.
- capOS mapping: focused-proof child of the RX queue-materialization proof that drives the first real RX DMA on the production cloud kernel. Completion is polled only: MSI-X stays disabled, no MSI-X table entry is programmed, no Interrupt waiter is installed, no dispatch slot is claimed, and both avail rings carry VIRTQ_AVAIL_F_NO_INTERRUPT so the device does not raise a queue-completion interrupt. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; both companion headline markers (virtio-net-tx-authority-bundle and virtio-net-rx-queue-materialization) are intentionally suppressed under this feature because this proof is the new headline owner. The marker’s trailing labels (rx_buffer=posted, avail=published, notify=rung-once, rx_completion=polled-used-ring, msix_rx_function_enable=not-toggled, msix_table_write=not-performed, device_autonomous_raise=not-claimed, provider_visible_queue_address=hidden, provider_rx_submit=kernel-proxy-bounded, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (RX MSI-X wait/ack, provider-driven RX submit, live cloud) carry. The cap::virtio_net_rx_buffer_post_polled_completion_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, cloud_virtio_net_tx_authority_bundle_proof, and every TX proof feature at the cap::mod.rs activation site (inherited through the implied RX-materialization feature’s compile_error!s).
- QEMU-emulable vs hardware-only: fully QEMU-emulable; the SLIRP -netdev user backend delivers the ARP reply that drives the RX DMA. Proved locally by make run-cloud-provider-virtio-net-rx-buffer-post. No GCE resources are created. Live-GCE RX stays under cloud-gcp-virtio-net-nic-driver.
Production cloud-boot evidence marker (virtio-net-msix-function-enable): under the focused-proof Cargo feature cloud_virtio_net_msix_function_enable_proof (which implies cloud_virtio_net_tx_queue_materialization_proof so the bundle observer
- production grant-source pickers + userspace bundle smoke keep their plumbing), the cloudboot kernel runs cap::virtio_net_msix_function_enable_proof (kernel/src/cap/virtio_net_msix_function_enable_proof.rs) over the same virtio-net function the authority bundle and queue-materialization proofs pick. The proof re-drives the modern virtio status sequence to DRIVER_OK, materializes one manager-owned TX virtqueue (identical to the queue-materialization proof), then walks the PCI MSI-X capability mask-first: it reads the Message Control register, writes FUNCTION_MASK = 1 first, reads back, writes ENABLE = 1 while keeping the function mask set, reads back, then cleans up by clearing both bits and reads back to assert PCI config-space MSI-X state is restored. Exposed through make run-cloud-provider-virtio-net-msix-function-enable.
- Spec basis: PCI SIG MSI-X §6.8.2 Message Control Register (bits 14 = Function Mask, 15 = MSI-X Enable); virtio 1.2 §2.1.2 (reset clears virtqueue state), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: stages 1-11 mirror the queue- materialization proof. Stages 12-15 are this proof’s delta: pci::interrupt_capabilities + MsixCapabilityInfo.offset resolve the capability header; pci::try_read_config_u16 reads the Message Control register at capability_offset + 0x02; the proof asserts the pre-state has MSIX_CONTROL_ENABLE clear, performs the mask-first write through pci::try_write_config_u16, reads back and asserts FUNCTION_MASK = 1, ENABLE = 0, performs the enable write keeping the mask, reads back and asserts both bits are set, then performs the cleanup write that clears both bits, reads back and asserts both are clear. The proof never programs an MSI-X table entry, never claims an interrupt-dispatch slot, and never raises a device-autonomous interrupt. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.-len.<hex>-q.<index>-size.-msix.cap.<hex>-msix.tsize.-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>.
- Fail-closed assertions: stages 1-11 inherit the queue- materialization proof’s five inline assertions. The MSI-X delta adds four more. (6) Pre-state read-back has MSIX_CONTROL_ENABLE clear. (7) Post-mask-write read-back has FUNCTION_MASK = 1 and ENABLE = 0. (8) Post-enable-write read-back has both bits set. (9) Post-cleanup-write read-back has both bits clear. Every exit path (success or any intermediate failure) runs a best-effort pci::try_write_config_u16 that clears MSIX_CONTROL_ENABLE and MSIX_CONTROL_FUNCTION_MASK regardless of the result chain, then writes 0 to device_status and frees every allocated queue frame. Per-stage outcomes log on the virtio-net-msix-function-enable: ok ... / ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
- capOS mapping: focused-proof child of the TX queue materialization proof that extends the same kernel-side activation surface with one round of canonical mask-first MSI-X function-level enable + cleanup. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; both companion headline markers (virtio-net-tx-authority-bundle and virtio-net-tx-queue-materialization) are intentionally suppressed under this feature because their queue_setup=not-attempted / msix_function_enable=not-toggled claims would be inaccurate now, so the MSI-X function-enable marker is the sole headline. The marker’s trailing labels (tx_descriptor=not-published, notify=not-rung, msix_function_enable=toggled-mask-first, msix_function_enable_cleanup=cleared, msix_table_write=not-performed, device_autonomous_raise=not-claimed, tx_completion=not-claimed, provider_visible_queue_address=hidden, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (interrupt-dispatch slot, descriptor publication, used-ring polling, live cloud) carry. The cap::virtio_net_msix_function_enable_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-msix-function-enable. No GCE resources are created.
Production cloud-boot evidence marker (virtio-net-tx-submit-doorbell): under the focused-proof Cargo feature cloud_virtio_net_tx_submit_doorbell_proof (which implies cloud_virtio_net_msix_function_enable_proof and transitively cloud_virtio_net_tx_queue_materialization_proof so the bundle observer + production grant-source pickers + userspace bundle smoke + mask-first MSI-X plumbing all keep firing), the cloudboot kernel runs cap::virtio_net_tx_submit_doorbell_proof (kernel/src/cap/virtio_net_tx_submit_doorbell_proof.rs) over the same virtio-net function the authority bundle, queue-materialization, and MSI-X function-enable proofs pick. The proof re-drives the modern virtio status sequence to DRIVER_OK, materializes one manager-owned TX virtqueue, enables MSI-X function-level control mask-first, then allocates one brokered TX payload frame and fills it kernel-half as a proxy for the userspace provider’s brokered fill, writes one TX descriptor at slot 0 of the descriptor ring, publishes one avail-ring entry and advances avail.idx to 1, maps the modern virtio notify region, rings the notify doorbell exactly once for the selected TX queue, reads the post-doorbell avail.idx and device-used.idx for visibility, then cleans up MSI-X mask-first, resets the device, and frees all four manager-owned frames. Exposed through make run-cloud-provider-virtio-net-tx-submit-doorbell.
- Spec basis: virtio 1.2 §2.7.6 (driver-area / available ring layout including idx at +2 and ring slots at +4), §2.7.8 (device-area / used ring layout including idx at +2), §4.1.4.4 (notify-cfg capability and per-queue notify address resolution as notify_bar_base + cap.bar_offset + queue_notify_off * notify_off_multiplier), §4.1.5.2 (modern virtio doorbell: u16 write of the queue index to the per-queue notify address), §5.1.6.2 (virtio-net TX descriptor layout). The submit ordering follows virtio 1.2 §2.7.13 (driver writes the descriptor head index to avail.ring[avail.idx % size], then bumps avail.idx after a suitable memory barrier).
- Implemented wire-format subset: stages 1-14 mirror the MSI-X function-enable proof (status sequence, queue materialization, mask-first MSI-X enable). Stages 15-21 are this proof’s submit /doorbell delta: frame::alloc_frame_zeroed allocates one payload frame, frame::hhdm_offset translates the manager-owned host- physical to a kernel virtual address for the kernel-proxy fill (a minimal 12-byte modern virtio-net header followed by an 8-byte b"CAPOSTX1" body, total payload length 20 bytes); slot 0 of the descriptor ring receives addr = payload_phys, len = 20, flags = 0, next = 0 over the HHDM write; avail.ring[0] = 0 and avail.idx = 1 over the HHDM with a compiler fence between them; the notify region is mapped through pci::map_bar_region and the kernel writes the queue index 1 as a u16 to notify_vaddr + queue_notify_off * notify_off_multiplier; avail.idx is read back and asserted as 1, and the device-written used.idx is read for visibility only. The proof never polls the used ring beyond the single visibility read, never claims a TX completion, never programs an MSI-X table entry, never raises a device-autonomous interrupt, never registers an Interrupt waiter, never performs direct DMA, never programs the IOMMU, and never exports a host-physical address or IOVA. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.-len.<hex>-q.<index>-size.-msix.cap.<hex>-msix.tsize.-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>-notify.bar..off.<hex>.mult..addr.<hex>-desc.<hex>-payload.<hex>-paylen.-availidx.-usedidx..
- Fail-closed assertions: stages 1-14 inherit the MSI-X function- enable proof’s nine inline assertions. The submit/doorbell delta adds three more. (10) Notify region length must be large enough to contain queue_notify_off * notify_off_multiplier + 2. (11) Notify-region map length must cover that minimum. (12) Post- doorbell avail.idx round-trip must read back as 1. Every exit path (success or any intermediate failure) runs the best-effort MSI-X cleanup, writes 0 to device_status, asserts every TX queue- state register cleared to 0, and frees all four manager-owned frames (descriptor, avail, used, payload) regardless of the result chain. The device-used.idx read is deliberately NOT asserted: QEMU may or may not have drained the descriptor by the time the kernel reads it, and the proof’s discipline says tx_completion=not- claimed regardless of the observed value. Per-stage outcomes log on the virtio-net-tx-submit-doorbell: ok ... / ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
- capOS mapping: focused-proof child of the MSI-X function-enable proof that extends the same kernel-side activation surface with one round of single-slot descriptor publish + single avail-ring entry + single notify doorbell ring, with no used-ring polling or completion claim. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; all three companion headline markers (virtio-net-tx-authority-bundle, virtio-net-tx-queue-materialization, and virtio-net-msix-function-enable) are intentionally suppressed under this feature because their tx_descriptor=not-published / notify=not-rung / queue_setup=not-attempted / msix_function_enable=not-toggled claims would all be inaccurate now, so the submit/doorbell marker is the sole headline. The marker’s trailing labels (tx_descriptor=published, notify=rung-once, msix_function_enable=toggled-mask-first, msix_function_enable_cleanup=cleared, msix_table_write=not-performed, device_autonomous_raise=not-claimed, tx_completion=not-claimed, provider_visible_queue_address=hidden, provider_fill=kernel-proxy-bounded, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (used-ring polling, provider waiter/ack, interrupt-dispatch slot claim, live cloud) carry. The cap::virtio_net_tx_submit_doorbell_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-tx-submit-doorbell. No GCE resources are created.
Kernel-half TX polled-completion proof (predecessor of virtio-net-userspace-provider): under the focused-proof Cargo feature cloud_virtio_net_tx_polled_completion_proof (which implies cloud_virtio_net_tx_submit_doorbell_proof and transitively cloud_virtio_net_msix_function_enable_proof / cloud_virtio_net_tx_queue_materialization_proof so every shared plumbing gate keeps firing), the cloudboot kernel runs cap::virtio_net_tx_polled_completion_proof (kernel/src/cap/virtio_net_tx_polled_completion_proof.rs) over the same virtio-net function the authority bundle, queue-materialization, MSI-X function-enable, and submit/doorbell proofs pick. The proof re-drives the modern virtio status sequence to DRIVER_OK, materializes one manager-owned TX virtqueue, enables MSI-X function-level control mask-first, allocates one brokered TX payload frame and fills it kernel-half as a proxy for the userspace provider’s brokered fill, publishes one TX descriptor + one avail-ring entry over the manager-owned ring frames, rings the notify doorbell exactly once for the selected TX queue, polls the device-authored used.idx from the manager-owned used-ring frame with a bounded retry budget until it reaches 1 (one consumed TX descriptor), reads the post-completion avail.idx and the device-authored used[0].(id, len), then cleans up MSI-X mask-first, resets the device, and frees all four manager-owned frames. The module is the predecessor of the userspace-submit polled-completion proof and is dropped from the compile set when the live-publish feature is on (the live-publish proof is the new headline owner of virtio-net-userspace-provider and exercises the same polled completion path through the userspace cap method instead of the kernel-half proxy).
- Spec basis: inherits the submit/doorbell proof’s basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13). The polled-completion delta uses §2.7.8 (used-ring layout: idx at +2 and 8-byte (id, len) slots at +4) for both the bounded used.idx poll and the device-authored used[0] slot read.
- Implemented wire-format subset: stages 1-20 mirror the submit/doorbell proof (status sequence, queue materialization, mask-first MSI-X enable, payload kernel-proxy fill, descriptor publish, avail bump, notify doorbell ring). Stages 21-23 are this proof’s polled-completion delta: the manager-owned used-ring used.idx HHDM read is wrapped in a bounded retry loop (with core::hint::spin_loop() between iterations) that converges on the target completion count 1, the post-completion avail.idx HHDM round-trip is asserted as 1, and the device-authored used[0].id / used[0].len are read with an Acquire compiler fence on the success path so the slot data is observed consistently with the used.idx bump. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.-len.<hex>-q.<index>-size.-msix.cap.<hex>-msix.tsize.-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>-notify.bar..off.<hex>.mult..addr.<hex>-desc.<hex>-payload.<hex>-paylen.-availidx.-usedidx.-polled.iter.-usedid.-usedlen..
- Fail-closed assertions: stages 1-20 inherit the submit/doorbell proof’s twelve inline assertions. The polled-completion delta adds three more. (13) The bounded used.idx poll must converge on the target 1 within the retry budget; budget exhaustion fails closed and reports the last observed value. (14) The post-completion avail.idx HHDM round-trip must still read back as 1. (15) The device-authored used[0].id must equal the published descriptor head index 0; used[0].len is recorded for visibility but is deliberately NOT asserted (virtio-net leaves len at 0 for TX device-readable chains, but the kernel does not gate the proof on that). Every exit path (success or any intermediate failure) runs the best-effort MSI-X cleanup, writes 0 to device_status, asserts every TX queue-state register cleared to 0, and frees all four manager-owned frames (descriptor, avail, used, payload) only after the final reset read-back is confirmed; if reset cannot be confirmed the frames stay retained rather than being returned while the device may still DMA them. Per-stage outcomes log on the virtio-net-userspace-provider: ok ... / ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
- capOS mapping: focused-proof child of the submit/doorbell proof that extends the same kernel-side activation surface with one round of bounded used.idx polling + one accounted completion + one device-authored used[0] slot read, paired with the userspace bundle smoke’s Interrupt cap handle-lifecycle discipline on the same MSI-X BDF (Interrupt.info round-trip identity assertion + release + post-release Interrupt.info fail-closed). That cap-side pairing covers cap-handle identity and post-release stale-handle rejection on the production Interrupt cap; it deliberately does NOT exercise Interrupt.wait/acknowledge, because the production InterruptCapProd::wait and InterruptCapProd::acknowledge paths are unimplemented in the non-qemu cloud kernel and fail closed (kernel/src/cap/interrupt_prod.rs). Real waiter/ack pairing on the virtio-net TX MSI-X route is deferred to a future child that either ports the cap::provider_cap_waiter_proof kernel-injected-dispatch + deferred-EOI discipline onto this route or programs an actual MSI-X table entry + dispatch slot. All four companion headline markers (virtio-net-tx-authority-bundle, virtio-net-tx-queue-materialization, virtio-net-msix-function-enable, and virtio-net-tx-submit-doorbell) are intentionally suppressed under this feature because their tx_descriptor=not-published / notify=not-rung / queue_setup=not-attempted / msix_function_enable=not-toggled / tx_completion=not-claimed claims would all be inaccurate now, so the polled-completion marker is the sole headline. The marker’s trailing labels (tx_descriptor=published, notify=rung-once, msix_function_enable=toggled-mask-first, msix_function_enable_cleanup=cleared, msix_table_write=not-performed, device_autonomous_raise=not-claimed, tx_completion=polled-used-ring, provider_visible_queue_address=hidden, provider_fill=kernel-proxy-bounded, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (interrupt-dispatch slot claim, real Interrupt.wait waiter, live cloud) carry. The cap::virtio_net_tx_polled_completion_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site. The marker keeps device_autonomous_raise=not-claimed because the proof never enables the per-vector MSI-X table entry, never registers an Interrupt.wait waiter, and observes the completion strictly through the device-authored used-ring update.
- QEMU-emulable vs hardware-only: predecessor-only. The active make run-cloud-provider-virtio-net headline target switched to the userspace-submit polled-completion proof below, which exercises the same polled-completion path through the userspace cap method and supersedes this kernel-half proxy.
Production cloud-boot evidence marker (virtio-net-userspace-provider): under the focused-proof Cargo feature cloud_virtio_net_tx_dmabuffer_live_publish_proof (which implies cloud_virtio_net_tx_polled_completion_proof and transitively the submit/doorbell, MSI-X function-enable, queue-materialization, and authority-bundle proofs so every shared plumbing gate stays compiled in), the cloudboot kernel runs cap::virtio_net_tx_dmabuffer_live_publish_proof (kernel/src/cap/virtio_net_tx_dmabuffer_live_publish_proof.rs) over the same virtio-net function the predecessor picks. Unlike the kernel-half polled-completion predecessor, the kernel-side proof here splits the work into two phases: at-boot init() stages the modern virtio status sequence + TX queue materialization + MSI-X mask-first enable + notify mapping, leaving the device in DRIVER_OK with MSI-X enabled-but-globally-masked; the per-call attempt_live_publish runs from the non-qemu device-manager stub’s validate_dmabuffer_submit_descriptor_admission when the userspace cloud-provider-virtio-net-tx-dmabuffer-live-publish-smoke service’s DMABuffer.submitDescriptor is admitted (queue == 1, descriptor_id == 0, length <= PAGE_SIZE, no user mapping live, no in-flight submit, kernel-known DmaBufferHandle). The cap method resolves the buffer’s host-physical bounce-buffer page, authors one TX descriptor + avail-ring entry over the manager-owned ring frames, rings the notify doorbell exactly once, polls the device-authored used.idx with the same bounded budget the polled-completion predecessor uses, reads used[0].(id, len), tears the device down (MSI-X mask-first cleanup + device reset + queue-state register read-back asserted to zero, three manager-owned queue frames freed), and emits one cloudboot-evidence: virtio-net-userspace-provider <token> headline marker with provider_fill=userspace-brokered-buffer anchoring the userspace-driven submit boundary. Exposed through make run-cloud-provider-virtio-net – the terminal local harness for the virtio-net userspace-provider chain.
- Spec basis: inherits the polled-completion proof’s basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13). The live-publish delta drives the same descriptor/avail/notify write sequence and the same bounded used.idx poll, but the descriptor’s addr field is the userspace-allocated DMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger, not a manager-allocated payload frame.
- Implemented wire-format subset: at-boot init() covers stages 1-14 of the polled-completion sequence (status sequence + queue materialization + mask-first MSI-X enable + notify mapping) and stashes the staged state. Per-call attempt_live_publish covers stages 15-24: descriptor publish (with desc[0].addr = payload_phys from the userspace DMABuffer), avail-ring entry + avail.idx bump with a release compiler fence, notify doorbell ring, used-ring used.idx bounded poll, used[0] slot read with an acquire compiler fence, MSI-X cleanup, device reset, queue-state register read-back, and queue-frame release. Token grammar adds pool..gen.-buf..gen.-payload.<hex>-paylen. so the manager-issued single-slot bounce-buffer pool’s slot/generation pair, the buffer’s slot/generation pair, and the resolved payload host-physical address are observable from the marker; the polled-completion marker’s desc.<hex> field is intentionally not in the live-publish marker because the per-call descriptor write happens after the marker emission window’s boundaries.
- Fail-closed assertions: stages 1-14 inherit the polled-completion proof’s assertions for status/queue/MSI-X bring-up. Stages 15-24 inherit its assertions for descriptor publish, doorbell, polled completion, MSI-X cleanup, and reset. The per-call admission gate adds five more, surfaced through the cap-side DmaBufferSubmitDescriptorAdmission shape: (1) queue != 1 fails closed with dmabuffer-tx-queue-required / non-tx-queue-rejected (RX is rejected explicitly; queue >= 2 trips the standard queue-out-of-range request gate). (2) descriptor_id != 0 fails closed with descriptor-id-out-of-range. (3) length > PAGE_SIZE fails closed with length-exceeds-buffer. (4) A live userspace VMA fails closed with dmabuffer-mapping-live (the cap-side block_submit_for_live_mapping short-circuit handles this before the device-manager runs; the stub defends in depth). (5) A second submitDescriptor on the same buffer without an intervening freeBuffer fails closed with dmabuffer-descriptor-already-inflight; the in-flight slot is dropped only when the parked-buffer record drops on freeBuffer. A post-freeBuffer submitDescriptor fails closed with the standard stale-handle error.
- capOS mapping: terminal local headline that flips the descriptor addr source from a manager-allocated kernel-proxy payload frame to the userspace-allocated DMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger. The marker’s trailing label provider_fill=userspace-brokered-buffer replaces the kernel-half polled-completion predecessor’s provider_fill=kernel-proxy-bounded to reflect the change. All four companion headline markers (virtio-net-tx-authority-bundle, virtio-net-tx-queue-materialization, virtio-net-msix-function-enable, and virtio-net-tx-submit-doorbell) are suppressed because the userspace-submit polled-completion proof is the new headline owner; the predecessor cap::virtio_net_tx_polled_completion_proof module is dropped from the compile set under this feature so its competing emission of the same virtio-net-userspace-provider marker cannot fire. The cap::virtio_net_tx_dmabuffer_live_publish_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site. The marker keeps device_autonomous_raise=not-claimed, msix_table_write=not-performed, and live_cloud=not-attempted because the proof never enables the per-vector MSI-X table entry, never registers an Interrupt.wait waiter, and observes the completion strictly through the device-authored used-ring update.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net. No GCE resources are created.
MSI-X wait/ack (cap::virtio_net_tx_msix_wait_ack_proof): Carries the userspace-submit polled-completion delta one authority step further: same brokered TX submit boundary, but the userspace-observed completion event is the provider-cap-side wake from a kernel-injected dispatch on the bound virtio-net TX MSI-X route. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_tx_msix_wait_ack_proof (which implies cloud_virtio_net_tx_dmabuffer_live_publish_proof and its predecessors). The wait/ack proof’s at-boot init runs after the live-publish proof’s init: it registers + claims an MSI-X route on the same virtio-net BDF under the ManagerGrantSource owner, maps the MSI-X table BAR kernel-side, writes table entry PROOF_TABLE_ENTRY = 1 mask-first per PCI 3.0 §6.8.2, attaches the route to the device manager, arms the deferred-LAPIC-EOI gate, and unmasks the route + entry. The PCI function-level MSI-X enable bit stays set with the function mask still asserted (held by the live-publish proof’s mask-first toggle), so the virtio-net device cannot autonomously raise an interrupt on the bound route. The cloudboot manifest spawns the cloud-provider-virtio-net-tx-msix-wait-ack-smoke userspace service, which receives one Console + DeviceMmio + DMAPool + Interrupt bundle (the Interrupt source resolves through the wait/ack proof’s grant source, replacing interrupt_grant_source_prod under this feature), asserts the Interrupt.info identity + labels (bootstrap_grant=virtio-net-tx-msix-wait-ack-proof, wait=kernel-injected-dispatch-wait, acknowledge=kernel-injected-deferred-eoi-acknowledge), drives the same brokered DMABuffer.submitDescriptor chain the predecessor exercises, then calls Interrupt.wait (the cap’s invoke_wait runs device_interrupt::handle_lapic_delivery and returns one delivery with delivery_count_after == delivery_count_before + 1 plus one armed deferred LAPIC EOI), calls Interrupt.acknowledge (the cap retires the deferred LAPIC EOI through acknowledge_deferred_lapic_eoi_for_route, ack_delta == 1, pending_after == 0), frees the DMABuffer, and releases the Interrupt cap. The kernel-side on_release then runs the masked-no-wake + reassign + stale-handle assertion chain on the bound route (mirroring cap::provider_cap_waiter_proof’s discipline) and emits exactly one cloudboot-evidence: virtio-net-userspace-provider <token> headline marker combining the publish outcome (recorded in PUBLISH_OUTCOME by the predecessor’s attempt_live_publish when the feature is on) with the wait/ack delivery counts, the reassigned route generation, and the stale-handle / stale-token assertion booleans. The marker’s trailing labels differ from the polled-completion predecessor in two places: tx_completion=msix-wait-ack-injected replaces tx_completion=polled-used-ring (the userspace-observed completion event is the cap-waiter dispatch; the polled used-ring still runs kernel-side as defence-in-depth), and msix_table_write=performed-masked-first replaces msix_table_write=not-performed (the wait/ack proof’s init programmed one MSI-X table entry). All other discipline labels are preserved: device_autonomous_raise=not-claimed, provider_visible_queue_address=hidden, provider_fill=userspace-brokered-buffer, iova_export=disabled-future-only, live_cloud=not-attempted. The predecessor live-publish proof’s standalone marker emission is suppressed under this feature so the headline marker name cannot fire twice. The cap::virtio_net_tx_msix_wait_ack_proof activation site is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof. Device-autonomous MSI-X delivery (programming the virtio queue’s queue_msix_vector for a hardware-raised TX completion interrupt and the broader production dispatch-slot proof), RX path, multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-tx-msix-wait-ack. No GCE resources are created.
RX MSI-X wait/ack (cap::virtio_net_rx_msix_wait_ack_proof): the RX analogue of the TX MSI-X wait/ack proof above. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_rx_msix_wait_ack_proof (which implies cloud_virtio_net_rx_buffer_post_polled_completion_proof and its predecessors). The RX completion is staged entirely kernel-side at boot: the RX buffer-post proof’s report() stages the device through DRIVER_OK, posts one manager-owned device-writable RX buffer, drives the ARP TX SLIRP stimulus, polls one real device->host RX DMA (used.idx == 1, used[0].len > 0), and – under this feature – additionally holds the PCI function-level MSI-X enable mask-first (hold_msix_function_enable_mask_first: FUNCTION_MASK = 1 then ENABLE = 1, held, not cleaned up) and records the publish outcome into the wait/ack proof’s PUBLISH_OUTCOME slot instead of emitting its standalone virtio-net-rx-buffer-post headline. The wait/ack proof’s at-boot init then drives the graduated always-built cap::interrupt_programmed::program_attach_arm_unmask over MSI-X table entry 0 (the RX queue’s per-queue config vector, virtio-pci §4.1.5.1.2) on the same virtio-net BDF under the ManagerGrantSource owner: register + claim + write table entry 0 mask-first per PCI 3.0 §6.8.2 + manager attach + deferred-LAPIC-EOI arm + route + entry unmask, tearing the route back down via teardown on any error or lost-init race. The device’s RX queue_msix_vector stays VIRTIO_MSI_NO_VECTOR and the function mask stays asserted, so the device cannot autonomously raise an interrupt on the bound route. The cloudboot manifest spawns the cloud-provider-virtio-net-rx-msix-wait-ack-smoke userspace service, which receives one Console + Interrupt bundle (the RX completion is staged kernel-side, so – unlike the TX wait/ack provider – it needs no DMAPool/DeviceMmio cap; the Interrupt source resolves through the wait/ack proof’s grant source, replacing interrupt_grant_source_prod under this feature), asserts the Interrupt.info identity + labels (bootstrap_grant=virtio-net-rx-msix-wait-ack-proof, wait=kernel-injected-dispatch-wait, acknowledge=kernel-injected-deferred-eoi-acknowledge), calls Interrupt.wait (the cap’s invoke_wait runs the graduated device_interrupt::wait_kernel_injected_dispatch and returns one delivery with delivery_count_after == delivery_count_before + 1 plus one armed deferred LAPIC EOI), calls Interrupt.acknowledge (the cap retires the deferred LAPIC EOI through acknowledge_deferred_lapic_eoi_for_route, ack_delta == 1, pending_after == 0), and releases the Interrupt cap. The kernel-side on_release then runs the masked-no-wake + reassign + stale-handle assertion chain on the bound RX route and emits exactly one cloudboot-evidence: virtio-net-userspace-provider <token> headline marker combining the RX publish outcome with the wait/ack delivery counts, the reassigned route generation, and the stale-handle / stale-token assertion booleans. The marker’s trailing labels differ from the RX buffer-post predecessor in three places: rx_completion=msix-wait-ack-injected replaces rx_completion=polled-used-ring (the userspace-observed completion event is the cap-waiter dispatch; the polled used-ring still runs kernel-side as defence-in-depth), msix_rx_function_enable=toggled-mask-first replaces msix_rx_function_enable=not-toggled (the staging now holds the function-level MSI-X enable mask-first), and msix_table_write=performed-masked-first replaces msix_table_write=not-performed (the wait/ack proof’s init programmed one MSI-X table entry). All other discipline labels are preserved: device_autonomous_raise=not-claimed, provider_visible_queue_address=hidden, provider_rx_submit=kernel-proxy-bounded, iova_export=disabled-future-only, live_cloud=not-attempted. The predecessor RX buffer-post proof’s standalone marker emission is suppressed under this feature so the headline marker name cannot fire twice. The cap::virtio_net_rx_msix_wait_ack_proof activation site is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, cloud_virtio_net_tx_authority_bundle_proof, and every TX/NVMe Interrupt-source proof feature. Device-autonomous RX MSI-X delivery (programming the virtio queue’s RX queue_msix_vector), provider-driven RX submit, multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-msix-wait-ack. No GCE resources are created.
RX provider-driven buffer submit (cap::virtio_net_rx_userspace_submit_proof): the RX analogue of the TX DMABuffer live-publish proof, carried one authority step past the RX MSI-X wait/ack proof above. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_rx_userspace_submit_proof (which implies cloud_virtio_net_rx_buffer_post_polled_completion_proof and its predecessors). Unlike the RX MSI-X wait/ack proof, the RX receive buffer is no longer a manager-owned bounce page filled kernel-side: it is the userspace provider’s brokered DMABuffer, posted to the RX avail ring through DMABuffer.submitDescriptor(queue=0). The feature drops the RX buffer-post module’s at-boot kernel-proxy report(); instead this proof’s self-contained init stages the device (status sequence + RX queue 0 + TX queue 1 materialization + held mask-first MSI-X function enable + notify map), allocates NO RX payload frame, and programs the kernel-injected RX MSI-X route over table entry 0 through the graduated cap::interrupt_programmed::program_attach_arm_unmask surface (same as the wait/ack proof). The cloudboot manifest spawns the cloud-provider-virtio-net-rx-userspace-submit-smoke userspace service, which receives one Console + DeviceMmio + DMAPool + Interrupt bundle (the RX provider, unlike the kernel-proxy RX wait/ack provider, needs the DMAPool/DeviceMmio caps to allocate and submit its brokered DMABuffer). The provider asserts Interrupt.info identity + labels (bootstrap_grant=virtio-net-rx-userspace-submit-proof), allocates one brokered bounce-buffer DMABuffer (NOT mapped or written before submit – the device is the RX writer), and calls DMABuffer.submitDescriptor(queue=0, descriptor_id=0, length=2048). The non-qemu device-manager admission gate matches the parked bounce-buffer handle, validates the request shape (queue == 0, descriptor_id == 0, length <= PAGE_SIZE, no live user mapping, no in-flight submit), resolves the buffer’s kernel-known host-physical bounce-buffer page, and drives attempt_rx_submit: it authors the RX desc[0] = (provider_buffer_phys, length, flags=VIRTQ_DESC_F_WRITE, next=0) + avail-ring entry over the manager-owned RX ring frames, rings the RX notify doorbell once, drives the ARP TX SLIRP stimulus (kernel-half), polls one real device->host RX DMA (used.idx == 1, used[0].len > 0), reads the observed EtherType, resets the device, and frees the manager-owned RX/TX ring + TX payload frames. The provider then observes the completion through Interrupt.wait (kernel-injected dispatch, delivery_count_after == delivery_count_before + 1) and Interrupt.acknowledge (deferred LAPIC EOI retired, ack_delta == 1), re-maps its DMABuffer R/O and reads a non-zero received EtherType through its own mapping, unmaps, frees the buffer, and releases the Interrupt cap. The kernel-side on_release runs the masked-no-wake + reassign + stale-handle assertion chain on the bound RX route and emits exactly one cloudboot-evidence: virtio-net-userspace-provider <token> headline marker combining the RX publish outcome with the wait/ack delivery counts.
- Spec basis: inherits the RX buffer-post / MSI-X wait/ack basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2). The userspace-submit delta drives the same RX descriptor/avail/notify write sequence and the same bounded used.idx poll, but the descriptor’s addr field is the userspace-allocated DMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger, not a manager-allocated payload frame.
- Implemented wire-format subset: at-boot init() covers the status sequence + RX/TX queue materialization + mask-first MSI-X function enable + notify mapping + RX MSI-X route program. Per-call attempt_rx_submit covers the RX descriptor publish (desc[0].addr = payload_phys from the userspace DMABuffer, flags = VIRTQ_DESC_F_WRITE), avail-ring entry + avail.idx bump with a release compiler fence, RX notify doorbell ring, the ARP TX stimulus, the used-ring used.idx bounded poll, used[0] slot read with an acquire compiler fence, the observed EtherType read, device reset, queue-state register read-back, and queue-frame release. Token grammar replaces the wait/ack marker’s rxpay.<hex> field with pool..gen.-buf..gen.-payload.<hex>-rxlen. so the manager-issued single-slot bounce-buffer pool’s slot/generation pair, the buffer’s slot/generation pair, the resolved payload host-physical address, and the requested receive length are observable from the marker.
- Fail-closed assertions: inherits the RX buffer-post proof’s assertions for status/queue/MSI-X bring-up and the RX descriptor publish + ARP stimulus + polled completion + reset, and the wait/ack proof’s masked-no-wake + reassign + stale-handle / stale-token assertion chain. The per-call admission gate adds the DMABuffer.submitDescriptor request-shape checks surfaced through the cap-side DmaBufferSubmitDescriptorAdmission shape: queue != 0 fails closed with dmabuffer-rx-queue-required / non-rx-queue-rejected (TX is rejected explicitly; queue >= 2 trips the standard queue-out-of-range request gate), descriptor_id != 0 fails closed with descriptor-id-out-of-range, length > PAGE_SIZE fails closed with length-exceeds-buffer, a live userspace VMA fails closed with dmabuffer-mapping-live, and a second submitDescriptor without an intervening freeBuffer fails closed with dmabuffer-descriptor-already-inflight. The marker’s trailing labels flip provider_rx_submit from kernel-proxy-bounded to userspace-brokered-buffer and add host_physical_user_visible=0 / direct_dma=blocked; the RX device-write DMA discipline (device_autonomous_raise=not-claimed, provider_visible_queue_address=hidden, iova_export=disabled-future-only, live_cloud=not-attempted) is preserved. Teardown confirms a device reset BEFORE the provider’s freeBuffer scrubs/frees the brokered buffer page.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-userspace-submit. No GCE resources are created. Device-autonomous RX MSI-X delivery (programming the virtio queue’s RX queue_msix_vector), multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.
RX production-IDT-dispatch waiter wake (cap::virtio_net_rx_production_idt_dispatch_proof): carries the RX userspace-submit proof one authority step further. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_rx_production_idt_dispatch_proof (which implies cloud_virtio_net_rx_userspace_submit_proof and its predecessors). The RX publish half – device staging, provider-submitted brokered receive buffer, SLIRP stimulus, one real device->host RX DMA, polled used.idx – is reused unchanged from the userspace-submit predecessor (its module is dropped and this proof becomes the new headline owner; the device-manager admission routes attempt_rx_submit here). The load-bearing change is the production IDT dispatch wiring: the non-qemu arch::x86_64::lapic::handle_device_interrupt arm previously discarded real device-MSI vectors with a bare eoi(), so a real interrupt-gate entry could never reach a deferred-EOI dispatch slot or wake an Interrupt.wait. This proof wires that arm to record an IDT handler entry and route the vector through device_interrupt::handle_lapic_delivery, honoring eoi_deferred (the deferred-EOI path owns the EOI write, retired by acknowledge) and keeping the bare eoi() fallback for unregistered/out-of-pool vectors. The Interrupt.wait cap method then fires ONE real INT $vector on the bound RX route’s vector (IF cleared – the syscall context runs IF-cleared by SFMASK design and INT n ignores IF; see the Fail-closed assertions bullet below) – graduating the qemu-only arch::lapic::inject_real_lapic_int_for_proof mechanic to this proof feature – so the waiter wakes through a real CPU interrupt-gate entry, not the synchronous device_interrupt::wait_kernel_injected_dispatch call every prior RX/cap-waiter proof used.
- Spec basis: Intel SDM Vol. 3 interrupt-gate semantics (an interrupt gate clears EFLAGS.IF on entry) and Vol. 2 INT n description (which ignores EFLAGS.IF); inherits the RX userspace-submit basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2) for the unchanged publish half.
- Implemented wire-format subset: identical to the userspace-submit proof for the publish half. The new surface is kernel interrupt-path wiring (no new device wire-format): the production handle_device_interrupt non-qemu arm (kernel/src/arch/x86_64/lapic.rs), a per-vector IDT handler-entry counter (device_interrupt::record_idt_handler_entry / idt_handler_entry_count), and the graduated inject_real_lapic_int_for_proof. The device’s RX queue_msix_vector stays VIRTIO_MSI_NO_VECTOR and the PCI function mask stays held; the INT is fired by this proof, NOT by the device.
- Fail-closed assertions: inherits the userspace-submit proof’s publish + masked-no-wake + reassign + stale-handle / stale-token chain, and adds: wait asserts delivery_count_after == delivery_count_before + 1, the per-vector IDT handler-entry count advanced by exactly one (idt_handler_observed), the real-delivery delta equals the IDT-entry delta (direct_dispatch_call_count_unchanged, i.e. no fallback synchronous dispatch was used), and one deferred LAPIC EOI is pending; the masked-route assertion now fires a real INT through the masked route and asserts NO delivery_count advance and NO deferred-EOI pending/ack change. The cap-dispatch syscall context runs with EFLAGS.IF cleared by SFMASK design (arch::x86_64::syscall) and INT n ignores IF, so int_fired_with_if is recorded as observed evidence only (false in this build) and is NOT a gating condition. The headline marker flips rx_completion to real-idt-interrupt-gate-wake and adds waiter_wake=real-idt-interrupt-gate, idt_dispatch=production-wired, plus the trailing -idthandler.1-directcall.1-iffired.<0|1>-maskedint.1 token booleans; device_autonomous_raise=not-claimed and live_cloud=not-attempted are preserved.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-production-idt-dispatch. No GCE resources are created. Flipping the device’s RX queue_msix_vector + clearing the function mask so the DEVICE raises the MSI – reusing this proof’s now-proven production dispatch path – is now covered by the device-autonomous MSI-X proof below. Live-GCE RX evidence remains future work.
RX device-autonomous MSI-X delivery proof (cap::virtio_net_rx_device_autonomous_msix_proof): carries the production-IDT-dispatch proof one authority step further and proves a device-raised virtio-net RX MSI-X reaches the production IDT path under local QEMU/KVM. Active under the non-qemu cloud kernel built with cloud_virtio_net_rx_device_autonomous_msix_proof (which implies cloud_virtio_net_rx_userspace_submit_proof and reuses the same brokered RX publish path). The module enables PCI memory-space decoding and bus mastering, stages RX queue 0 and TX queue 1, enables MSI-X function-level control mask-first, programs RX queue 0 COMMON_QUEUE_MSIX_VECTOR = 0, programs MSI-X table entry 0 through cap::interrupt_programmed::program_attach_arm_unmask, clears the PCI function mask, and then submits one userspace-owned RX bounce buffer plus the ARP TX stimulus. The RX DMA succeeds (used[0].len > 0, observed EtherType 0x0806), proving the data path remains the same brokered provider path.
- Spec basis: virtio 1.2 §4.1.5.1.2 (modern per-queue MSI-X vector is the MSI-X table entry index), virtio 1.2 §2.7.6 / §2.7.8 / §5.1.6 for the RX descriptor/avail/used path, and PCI 3.0 §6.8.2 for MSI-X table entry and Message Control masking semantics.
- Implemented wire-format subset: the proof writes only the PCI COMMAND memory-space/bus-master bits, the RX queue config-vector selector, the same split-ring RX/TX descriptors and avail entries as the userspace-submit proof, one RX and one TX notify, one MSI-X table entry, and the PCI MSI-X function mask bit. It does not expose host-physical or IOVA addresses to userspace, does not program an IOMMU, and does not add multi-queue or full-NIC readiness.
- Proof assertions: make run-cloud-provider-virtio-net-rx-device-autonomous-msix now asserts pci_command=0x0107, one device-raised Interrupt.wait delivery on vector 0x50 with int_injected=0, delivery_count_before=0, delivery_count_after=1, idt_handler_observed=true, eoi_deferred=true, and one deferred-EOI Interrupt.acknowledge (ack_delta=1). The final cloudboot-evidence: virtio-net-userspace-provider marker includes pcicmd.0107, idthandler.1, directcall.1, devraise.1, intinjected.0, and rx_completion=device-autonomous-msix. Closeout validation also keeps the RX production-IDT dispatch, RX userspace-submit, provider cap-waiter, run-net, and default boot-smoke gates green under local QEMU.
- QEMU/KVM diagnosis: earlier bpftrace evidence showed QEMU reached msix_notify(vector=0) with an unmasked MSI-X entry and prepared 0xfee00000/0x50, but KVM did not accept vector 0x50. The missing precondition was explicit PCI COMMAND bus-master enablement in this proof path; after the proof enables memory-space decoding + bus mastering, local QEMU/KVM delivers the MSI-X to the guest IDT path.
RX polled used-ring completion (no injected dispatch) (cap::virtio_net_rx_polled_completion_proof): the first virtio-net proof whose RX completion signal is real driver progress, not an injected proxy. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_rx_polled_completion_proof (which implies cloud_virtio_net_rx_userspace_submit_proof and its predecessors). The RX publish half – device staging, provider-submitted brokered receive buffer, SLIRP stimulus, one real device->host RX DMA, polled used.idx – is reused unchanged from the userspace-submit predecessor (its module is dropped and this proof becomes the new headline owner; the device-manager admission routes attempt_rx_submit here). The load-bearing change is on the completion path: every prior virtio-net/cap-waiter proof signalled the Interrupt.wait completion through device_interrupt::wait_kernel_injected_dispatch (a kernel-side dispatch-slot proxy) or, in the IDT-dispatch proof, a fired INT $vector – neither produced by real driver progress. Here virtio_net_rx_polled_completion_proof::invoke_wait instead reports the completion from the already-latched polled used-ring state captured during attempt_rx_submit (the PublishedRx used_id == 0 / used_len > 0 / polled_used_idx >= POLL_TARGET_USED_IDX, latched from the predecessor’s reused poll_used_idx under its Acquire fence): there is NO wait_kernel_injected_dispatch call and NO inject_real_lapic_int_for_proof anywhere in the wait/ack path, and zero kernel-injected interrupts. invoke_acknowledge is a poll-confirmation no-op (no deferred LAPIC EOI to retire, since no interrupt was taken). The bound RX MSI-X route is still programmed at boot but is used ONLY by the release-time masked-no-wake/stale-handle assertion chain.
- Spec basis: virtio 1.2 §2.7.8 (used ring is a memory-visible structure the device advances) and §2.7.10 (the VIRTQ_AVAIL_F_NO_INTERRUPT driver flag the predecessor already sets, so the device performs no MSI either); inherits the RX userspace-submit basis (virtio 1.2 §2.7.6 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2) for the unchanged publish half.
- Implemented wire-format subset: identical to the userspace-submit proof for the publish half; no new device wire-format and no new kernel interrupt-path wiring. The completion is a pure memory read of the latched used-ring state plus a device_interrupt::snapshot_dispatch_slot before/after delivery_count comparison.
- Fail-closed assertions: inherits the userspace-submit proof’s publish + masked-no-wake + reassign + stale-handle / stale-token chain, and replaces the wait/ack injection assertions with their polled inverse: wait asserts the latched used_id == 0 / used_len > 0 / polled_used_idx >= POLL_TARGET_USED_IDX (completion_observed) AND delivery_count_after == delivery_count_before (int_injected=0, no kernel dispatch advanced); acknowledge asserts no deferred LAPIC EOI was pending and none was retired (hardware_dispatch_ack_delta == 0, eoi_written=false); and on_release requires provider_observed_dispatch == 0 and provider_observed_ack == 0 (the inverse of the injected predecessor’s >= 1). The headline marker flips rx_completion to polled-used-ring and adds waiter_wake=polled-used-ring, int_injected=0, with the trailing -deliv.0-ack.0 token booleans; device_autonomous_raise=not-claimed and live_cloud=not-attempted are preserved.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-polled-completion. No GCE resources are created. Graduating this polled provider off the per-proof feature onto the default system.cue cloudboot manifest, programming the device’s RX queue_msix_vector for device-autonomous delivery, and any live-GCE RX evidence are future work.
Polled RX+TX provider, always-built off the per-proof feature (cap::virtio_net_polled_provider): graduates the polled provider above into the production compile set. The module is always-built in the default non-qemu cloud kernel (cfg(not(feature = "qemu")), no cloud_*_proof feature), derived from cap::virtio_net_rx_polled_completion_proof with the proof gate removed and the feature-gated virtio_net_tx_authority_bundle_proof bundle-observer calls dropped (the per-grant identity is still recorded through the always-built hardware_audit cap-audit). The polled completion behaviour (read the latched poll_used_idx used-ring state in invoke_wait, no wait_kernel_injected_dispatch, no inject_real_lapic_int_for_proof, no-op invoke_acknowledge) is identical; only the activation switch changes from a Cargo feature to a manifest-observable condition. kernel::run_init calls virtio_net_polled_provider::init only when the booted manifest declares the cloud-provider-virtio-net-polled-provider-default-smoke binary, so on the literal system.cue, run-cloud-interrupt-grant, and every other default cloudboot manifest the provider is never staged (is_staged()==false) and is inert. The interrupt cap is granted through the unchanged production interrupt_grant_source_prod (no new KernelCapSource arm, no proof-only grant-source replacement); that source delegates its cap to virtio_net_polled_provider::build_cap_for_grant while the provider is staged, otherwise it keeps its admission-check-only skeleton. The always-built device_manager::stub submit-admission preview and accepted path admit RX queue 0 and route DMABuffer.submitDescriptor to virtio_net_polled_provider::attempt_rx_submit only while staged.
- Marker: emits a DISTINCT cloudboot-evidence: virtio-net-polled-provider <token> headline (vs the proof’s virtio-net-userspace-provider) so the two manifests are distinguishable, adding the labels provider_build=always-built-default-kernel, provider_feature_gate=none, and grant_source=production-despecialized to the polled-completion label set. The cap::provider_nic_bind_proof::report re-point landed (the provider-nic-bound marker above now fires from this provider’s real polled TX+RX completion via report_real_completion on the cloud-provider-nic-bound-real-polled-driver-smoke manifest); the literal system.cue fold is not yet implemented.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-polled-provider-default on the default non-qemu kernel with no cloud_*_proof feature. No GCE resources are created; live_cloud=not-attempted.
Polled provider teardown / stale-authority (clean cap-op-release) – the always-built polled provider now carries an asserted S.11.2 teardown/stale-authority chain over its DMA + MMIO + IRQ authority on the clean cap-op-release path, not only the IRQ route. When the dedicated teardown manifest is booted (run_init calls virtio_net_polled_provider::arm_teardown_report because the cloud-provider-virtio-net-polled-teardown-smoke binary is declared), the provider’s complete_after_release (kernel/src/cap/virtio_net_polled_provider.rs, run_teardown_assertions + emit_teardown_evidence) re-validates the brokered DMA + DeviceMmio authority the smoke released before the Interrupt cap and emits one combined cloudboot-evidence: virtio-net-polled-teardown <token> headline.
- Mechanisms reused: device_manager::validate_dmabuffer_record / validate_dmapool_record (stale DMA handle / stale pool-allocate rejected fail-closed), device_manager::last_bounce_page_release_evidence (the scrub-before-free / ledger-removed ordering stamped by detach_dmabuffer_record_for_cap_release), device_manager::validate_devicemmio_record over devicemmio_grant_source_prod::last_issued_handle_and_owner (the granted DeviceMmio cap’s record is detached on release, so a stale access fails closed), and the inherited device_interrupt masked-no-wake / reassign / stale-handle chain folded into the same marker.
- Marker labels: stale_dma_buffer_blocked=true, dma_page_scrubbed_before_free=true, dma_ledger_removed_after_scrub=true, stale_dma_pool_alloc_blocked=true, stale_mmio_blocked=true, mmio_handle_invalidated=true, masked_no_wake=true, reassign_generation_bumped=true, stale_token_wake_blocked=true, stale_route_handle_blocked=true, int_injected=0, host_physical_user_visible=0, direct_dma=blocked, iova_export=disabled-future-only. The marker is suppressed fail-closed if any leg regresses. The MMIO leg rides the granted DeviceMmio cap; the provider’s own pci::map_bar_region BAR mapping stays boot-only with no kernel invalidation API by design.
- Scope boundary: the clean cap-op-release marker carries driver_death_teardown=not-attempted-this-slice. The process-exit-under-active-authority teardown trigger is its own proof (see the next entry); it crosses the process-lifecycle authority boundary, and the single-shot provider (one Interrupt cap, single-shot attempt_rx_submit) cannot drive both teardown paths in one boot, so it has its own manifest/boot.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-polled-teardown on the default non-qemu kernel with no cloud_*_proof feature. live_cloud=not-attempted.
Polled provider DRIVER-DEATH / process-exit teardown – the always-built polled provider’s release-time teardown chain now also covers the process-exit-under-active-authority trigger. When the dedicated driver-death manifest is booted (run_init calls virtio_net_polled_provider::arm_driver_death_report because the cloud-provider-virtio-net-polled-driver-death-smoke binary is declared), the smoke drives the same real polled RX submit/wait/ack + RX read-back, scrubs+frees its DMABuffer, and then exits while still holding its DMAPool, DeviceMmio, and Interrupt caps. The kernel’s CapReleaseReason::ProcessExit cap-teardown reclaims all three in cap-table slot order (device_mmio then dmapool before the interrupt cap), so the provider’s complete_after_release process-exit arm (run_teardown_assertions + emit_driver_death_evidence) re-validates the now-stale DMA + DeviceMmio authority and runs the IRQ masked-no-wake / reassign / stale-handle chain over the route, emitting one cloudboot-evidence: virtio-net-polled-driver-death <token> headline.
- Mechanisms reused: identical to the clean cap-op-release entry above, but the DMAPool / DeviceMmio / Interrupt records are detached by the kernel’s ProcessExit cap-teardown rather than explicit Interrupt.release / DMAPool.release / DeviceMmio.release calls. The runtime-allocated DMABuffer cap is torn down AFTER the manifest-granted Interrupt cap in slot order, so its page is scrubbed+freed by the smoke’s freeBuffer before exit (the normal DMABuffer lifecycle); the buffer is then re-validated stale with its scrub-before-free ordering intact.
- Marker labels: same DMA / MMIO / IRQ / no-export discipline labels as the clean-release marker, plus driver_death_teardown=no-live-authority and release_path=process-exit. The marker is suppressed fail-closed if any leg regresses (polled_completion_clean, the DMA/MMIO stale re-validation, or the IRQ chain). The clean-release virtio-net-polled-teardown headline and the cap-op-release-gated virtio-net-polled-provider headline both stay absent on this manifest (the Interrupt cap is reclaimed via ProcessExit, not an explicit release).
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-polled-driver-death on the default non-qemu kernel with no cloud_*_proof feature. live_cloud=not-attempted.
Provider-chain closeout: the parent cloud-prod-virtio-net-userspace-provider-local-proof is closed by the decomposed child chain above and the legacy/transitional bind below. The local non-qemu cloudboot/QEMU evidence now includes modern TX/RX userspace-provider proofs, the always-built polled provider, real-polled-driver provider-nic-bound, clean-release/process-exit stale-authority proofs, and the legacy-polled path that later passed the real-GCE provider-nic-bound gate. This closeout does not claim L4 socket/smoltcp relocation, literal system.cue provider fold, reusable full-NIC/multiqueue readiness, or device-autonomous MSI-X delivery; those remain separate lanes.

4. Legacy / transitional virtio 0.9 PIO transport (cloud bind)

Everything above is the modern (virtio 1.x) transport: vendor capability windows in MMIO BARs + MSI-X. Real GCE presents the NIC as a legacy / transitional virtio 0.9 device instead (run 1780377997-281b, 2026-06-02): PCI 1af4:1000, no modern vendor capability windows, no usable MMIO memory BAR, legacy INTx, no MSI-X. The whole legacy virtio config block lives in a PIO (I/O space) BAR0 register window, which the modern transport discovery cannot represent, so the modern polled provider selects no candidate. This section maps the legacy PIO transport subset the kernel implements to bind that device. It has two parts: selection + brokered PIO config (kernel/src/cap/virtio_net_legacy_select_proof.rs) and the legacy single-PFN contiguous-queue polled TX/RX data path + provider-nic-bound (kernel/src/cap/virtio_net_legacy_datapath_proof.rs); both are implemented and proved locally.

Spec basis: Virtual I/O Device (VIRTIO) legacy interface — the pre-1.0 “Virtio PCI” I/O-BAR register layout (virtio 0.9.5 / the legacy appendix of the OASIS 1.x spec, §4.1.4.8 “Legacy Interfaces”). Cross-checked against QEMU hw/virtio/virtio-pci.c (virtio_pci_config_* legacy I/O ops) and the Linux virtio_pci_legacy driver.
Implemented wire-format subset (no-MSI-X legacy I/O register block, kernel/src/cap/virtio_net_legacy_select_proof.rs LEGACY_* offset constants): device features (0x00, u32 RO), guest/driver features (0x04, u32 RW), queue PFN (0x08, used by the data path), queue size (0x0c, u16 RO), queue select (0x0e, u16 RW), queue notify (0x10, u16 RW), device status (0x12, u8 RW), ISR status (0x13, u8 RO). The device-specific config (MAC, …) follows at 0x14 in the no-MSI-X layout (offsets shift by 4 only when MSI-X is enabled, which this polled path never does). Feature negotiation is the 32-bit legacy feature word: VIRTIO_F_VERSION_1 (a high feature bit) is unrepresentable and absent, and VIRTIO_NET_F_MAC (1 << 5) is required and acknowledged. The legacy ring uses the single-PFN contiguous virtqueue layout (descriptor table + avail ring + padding to VIRTIO_PCI_VRING_ALIGN (4096) + used ring, addressed by one page-frame number written to LEGACY_QUEUE_PFN as physical_address >> 12) — materialized by the data path (virtio_net_legacy_datapath_proof::materialize_queue).
Legacy single-PFN data path (kernel/src/cap/virtio_net_legacy_datapath_proof.rs): after the status + feature handshake, both virtqueues are materialized as single physically contiguous, page-aligned regions (frame::alloc_contiguous) whose descriptor/avail/used sub-addresses are computed from the contiguous base, so the in-ring desc/avail/used manipulation reuses the modern provider’s helpers (write_desc_slot_0, write_avail_*, poll_used_idx, read_used_ring_slot_0) unchanged — only the transport differs. The doorbell is a PIO write of the queue index to LEGACY_QUEUE_NOTIFY (no modern MMIO notify region). The virtio-net header is the 10-byte legacy header (no VIRTIO_NET_F_MRG_RXBUF; the modern path uses 12). The reset is polled with a bounded settle (real legacy hardware acknowledges reset asynchronously; QEMU clears synchronously) rather than a single-shot == 0. A real polled TX (queue 1) + RX (queue 0) completes by reading the used rings (no MSI-X route programmed, no interrupt taken or injected); the device is then reset clean and all DMA frames are freed.
Device-fixed queue size + contiguous-allocation bound (materialize_queue, MAX_LEGACY_QUEUE_SIZE, vring_layout): legacy virtio queue size is device-fixed and read-only (LEGACY_QUEUE_NUM, 0x0c) — the driver cannot shrink it, so it must materialize whatever single-PFN vring the device advertises. materialize_queue reads that size and rejects a zero, non-power-of-two, or over-bound value cleanly (the vring layout requires a power-of-two size; the bound caps the contiguous allocation), then sizes the contiguous region via vring_layout: 256 → 3 pages, 1024 → 8 pages, and the live GCE Andromeda virtio-net’s 4096-entry queue → ~28 pages per queue. MAX_LEGACY_QUEUE_SIZE is the virtio spec maximum (32768, a power of two), so the bound admits any spec-legal device-fixed size — including GCE’s 4096 — while still failing closed above it; an alloc_contiguous that cannot satisfy the request fails closed (no panic) on the existing alloc … contiguous frames … failed arm. QEMU’s legacy virtio-net advertises 256 by default and caps queue size at 1024 (VIRTQUEUE_MAX_SIZE), and locks tx_queue_size at 256 for the non-vhost SLIRP device, so the largest local shape is rx_queue_size=1024 (an 8-page RX vring, exercised by make run-cloud-provider-nic-bound-legacy-large-queue); the exact 4096-entry materialization is only verifiable on real GCE (the billable live-GCE run).
GCE-viable RX stimulus / completion (fill_dhcp_discover_legacy, read_device_mac, poll_rx_used_wall_clock): the TX stimulus is a broadcast DHCP DISCOVER sourced from the device’s real MAC (read from legacy device-config space at 0x14), not the modern path’s ARP “who-has 10.0.2.2” from a hardcoded spoofed source. This is required for a real cloud NIC: GCE’s Andromeda SDN enforces MAC/IP anti-spoofing (egress from a non-assigned MAC/IP is dropped), 10.0.2.2 does not exist on the VPC, and no responder answers an ARP-for-the-gateway. A legitimately-sourced DHCP DISCOVER is answered by both QEMU SLIRP’s built-in DHCP server and the GCE SDN DHCP responder, giving a real device->host RX frame. The completion model is accept-any inbound frame (any non-empty frame with a readable EtherType satisfies RX, so an ambient gateway ARP/RA on GCE counts too), polled against a wall-clock budget (monotonic_ns deadline, 5 s) rather than a fixed spin count sized for SLIRP’s instantaneous reply. Interrupts are masked during this boot-time proof, so the wall-clock budget relies on the TSC-calibrated clocksource (the QEMU and GCE case); a tick-derived clock is frozen here and a fixed iteration ceiling is the fail-closed backstop. The egress MAC is re-asserted non-zero / non-broadcast before the marker is emitted, and the marker token carries it (-srcmac.<12hex>).
Persistent legacy Nic-cap runtime (virtio_net_legacy_datapath_proof::legacy_nic_runtime, kernel feature cloud_gce_legacy_virtio_webui_serving_proof): unlike the one-shot proofs, this runtime brings the legacy device up once at boot and keeps it DRIVER_OK for the whole boot, backing the same typed Nic cap methods the modern shim path serves (transmit @0, macAddress @2, linkStatus @3, receivePoll @4; receive @1 fails closed). RX keeps a small posted buffer pool (RX_POOL_SIZE descriptors, recycled in place after each copy-out); receivePoll is non-blocking and compares the device-written used.idx against a consumed cursor (read_used_idx), so a frame burst that advances the index past cursor + 1 is drained one completion per call instead of being missed by an equality poll. TX publishes one frame at descriptor slot 0 and drains its completion with the same bounded advanced-past-cursor check; an unresolved or divergent completion (and any other ring-integrity violation on either queue) is a fatal error that tears the runtime down through a reset-confirmed fail-stop, after which every cap call fails closed. Frames cross the cap boundary as inline Data with the 10-byte legacy header added/stripped kernel-side; PIO, vring, and DMA-frame ownership stay kernel-side, and release quiesces the device (reset-confirmed before frames are freed). VIRTIO_NET_F_STATUS is not negotiated, so linkStatus reports assumed-up while the runtime is staged. This is the serving bridge the Phase C userspace network stack uses on the GCE NIC shape; proof make run-cloud-gce-legacy-virtio-webui-serving (host HTTP peer fetches the remote-session Web UI bundle through QEMU hostfwd over this datapath).
capOS mapping (brokered PIO config access): capOS device authority (DeviceMmio/DDF) is MMIO/memory-BAR based and there is no I/O-port capability. The legacy config window stays kernel-owned: the only sanctioned path to a device’s legacy I/O BAR is the bounds-checked pci::LegacyIoBar accessor (kernel/src/pci.rs), reached through pci::io_bar(device, bar). Every access is range-checked against the device’s claimed I/O BAR window and the 16-bit x86 port space, so a caller cannot reach a port outside the BAR; there is no ambient in/out authority and no port-I/O surface exposed to userspace. PCI I/O decoding is enabled per device via pci::enable_io_space_and_bus_master before any access.
Candidate gate (MSI-X not required): the legacy candidate selection (virtio_net_legacy_select_proof::pick_legacy_candidate) accepts a transitional virtio-net function (1af4:1000, network class) whose modern common-config window does not resolve and which exposes a usable I/O BAR0 — without requiring an MSI-X capability, because the polled data path does not depend on interrupt delivery. This is the deliberate relaxation of the modern gate (virtio_net_polled_provider::candidate_from_device, which requires both the modern transport and MSI-X).
Fail-closed rules: the brokered status handshake fails closed on any out-of-window access, any device-status regression, a missing VIRTIO_NET_F_MAC, a guest-feature write-back mismatch, or a zero queue-0 size. The data path additionally fails closed on a device-MAC read failure (out-of-window, all-zero, or broadcast), an out-of-window queue PFN/notify access, a PFN read-back mismatch, an advertised queue size that is zero or exceeds the materialization bound, a TX used-ring poll-budget exhaustion, an RX wall-clock-budget exhaustion, a used-ring id/len regression, a zero EtherType, or a final reset that does not settle to 0x00. On any such failure no provider-nic-bound marker is emitted (the gate stays null/fail-closed), and the device is reset and its DMA frames freed regardless of outcome. The completion is observed by polling the legacy used ring with no MSI-X route programmed and no interrupt taken or injected; the provider-nic-bound marker (provider_nic_bind_proof::report_real_completion_legacy) carries honest transport=legacy-pio-virtio-0.9, interrupt_model=polled-no-msix, and userspace_driver_authority=kernel-brokered-legacy-polled labels, and nothing is exported to userspace (host_physical_user_visible=0, direct_dma=blocked).
QEMU-emulable vs hardware-only: the legacy shape is QEMU-emulable via qemu-system-x86_64 -device virtio-net-pci,disable-modern=on,vectors=0 (legacy I/O BAR0, INTx, no MSI-X — the faithful GCE shape). Proved locally by make run-cloud-provider-virtio-net-legacy-select on the default non-qemu kernel: the kernel selects the legacy NIC over its I/O BAR0, runs the brokered device-status handshake + 32-bit feature read (observed device_features=0x79bf8064, VIRTIO_NET_F_MAC set) + queue-0 size read (256), and emits cloudboot-evidence: virtio-net-legacy-candidate-selected <token>. The data path is proved by make run-cloud-provider-nic-bound-legacy on the same device shape: the kernel reads the device’s real MAC (52:54:00:12:34:56 under QEMU), materializes the legacy single-PFN virtqueues, TX-submits a broadcast DHCP DISCOVER from that MAC, and completes a real polled RX against the wall-clock budget (observed src_mac=52:54:00:12:34:56, rx_used_len=600, ethertype=0x0800 — the SLIRP DHCP OFFER, an IPv4 frame — tx_used_idx=1, rx_used_idx=1, rx_clock_usable=true, final_status=0x00), then emits exactly one cloudboot-evidence: provider-nic-bound <token> sourced from that completion (token carries -ethertype.0800 and -srcmac.525400123456). The DHCP-discover-from-real-MAC stimulus is the GCE-viable path: GCE’s Andromeda SDN drops egress from a spoofed source MAC/IP and has no 10.0.2.2 ARP responder, so the modern path’s spoofed ARP-to-SLIRP-gateway stimulus would time out on a real NIC; a real-MAC DHCP DISCOVER is answered by both SLIRP and the GCE SDN. The follow-up billable real-GCE run cloud-prod-gce-billable-boot-real-polled-nic-bound passed on 2026-06-02 15:03 UTC through this legacy path: the live 1af4:1000 NIC bound at 00:04.0, materialized the 4096-entry RX/TX vrings, transmitted DHCP DISCOVER from the device MAC, received a real IPv4 frame (rx_used_len=532, ethertype=0x0800), and emitted provider-nic-bound from report_real_completion_legacy. This remains a bounded raw-frame bind proof, not L4 networking or a reusable userspace provider service.

kernel/src/virtio.rs – PCI transport discovery, split-ring transport, feature negotiation, framing.
kernel/src/cap/network.rs – accepted-socket cap state and the network capability surface.
docs/proposals/networking-proposal.md – the userspace network-stack move (Phase C) and the transitional-kernel status.
docs/dma-isolation-design.md – the DMA backend and isolation model the userspace successor binds into.

Keyboard shortcuts

capOS Documentation

virtio-net (modern PCI NIC)

1. Spec basis

2. Wire format (implemented subset)

3. capOS mapping

4. Legacy / transitional virtio 0.9 PIO transport (cloud bind)

Related