Phase C: Userspace virtio-net Driver Relocation
This is the L4 track opened by the
Network-Reachable Datapath Scope Decision
(Option A): raw-frame TX/RX reachability is the cloud milestone bar, and the L4
socket path – relocating smoltcp and the cap/network.rs socket caps out of
the cfg(qemu) kernel/src/virtio.rs into userspace processes
(networking-proposal Part 3, Phase C) – is a separate future track. This doc
designs that track and sequences its slices.
The first Phase C Web UI path remains IPv4-scoped: userspace L4 plus DHCP/IPv4 configuration, ARP, and the private/public GCE Web UI proofs. IPv6 is tracked as a separate network-stack capability lane in Networking and the hardware/cloud backlog. Phase C must preserve enough address-family shape for that lane, but lack of IPv6 does not block the first IPv4 GCE Web UI proof.
Cap-Surface Delta: What The Userspace Driver Needs
The current DeviceMmio / DMAPool / Interrupt cap surface does not yet host
the virtio-net driver in userspace as built, but the missing pieces are bounded
extensions of accepted patterns and a reuse of the landed production
DMA-isolation track – not new isolation built from scratch. Per-primitive
evidence:
DeviceMmiogives a read-only BAR window with one selected write today.DeviceMmio.mapreturns a read-only BAR page; rawwrite32is refused (register_write = "blocked"), with exactly one selected write permitted – the notify doorbell at@5(notify_doorbell,kernel/src/cap/device_mmio.rs). A driver must additionally write the virtio common-config window (device status, feature-select/feature, queue-select, queue-size, queue-address/queue-enable). The relocation adds these as further selected writes under the same accepted range-check + read-back discipline the notify doorbell already enforces (see “The Common-Config Window” below) – not a new write primitive.DMAPooldoes not yet export a device-usable address to this driver, but the export discipline is landed.DMAPoolgives one bounce page; the host-physical / device address is not exported in the bounce posture (host_physical_user_visible = false,direct_dma = "blocked",iova_export = "disabled-future-only",kernel/src/cap/dma_buffer.rs). The vring is kernel-owned today (kernel/src/cap/virtio_net_polled_provider.rs), so userspace does not yet place its own descriptors. The mechanism to let it do so safely – a manager-owned bounce buffer or a domain-scoped IOMMU IOVA, never a raw host-physical address – is already landed (the production DMA-isolation track; see “The Userspace-Ownable vring Slice” below); the slice-2 work wires it to the driver’s vring.Interruptis wait-only over a kernel-latched used ring.Interrupt.waitreads a kernel-latched used-ring index,acknowledgeis a no-op, and mask/unmask are refused. A real driver owns its IRQ lifecycle (mask, unmask, EOI ordering).
Classifying the virtio-net bring-up steps against what userspace can do today makes the gap concrete – almost every step is kernel-only:
| Bring-up step | Userspace-doable today? |
|---|---|
| Device reset (write status = 0) | No – needs writable status register |
| ACKNOWLEDGE / DRIVER status bits | No – needs writable status register |
| Feature negotiate (select + read + write + FEATURES_OK) | No – needs writable feature-select/feature + status |
| Queue program (queue-select, queue-size, queue-address, queue-enable) | No – needs writable common-config + a device-usable vring address |
| vring allocation (avail/used/descriptor tables) | No – vring is kernel-owned; no device-usable buffer address export |
| DRIVER_OK (write status) | No – needs writable status register |
| MSI-X program / vector assignment | No – kernel-owned |
| Submit + notify (ring doorbell) | Partial – the one selected doorbell write @5 exists |
| Poll used ring | Partial – via the kernel-latched index Interrupt.wait reads |
| Teardown (reset, scrub, release) | No – kernel-owned reset path |
The Nic ABI (Inline Data, Per the Proposal Draft)
This track keeps the networking-proposal Part 3 frame ABI: frames cross the cap
boundary as inline Data (transmit @0 (frame :Data), receive @1 () -> (frame :Data), networking-proposal:443). The kernel copies the frame into and
out of the manager-owned bounce buffer the polled provider already established,
so no host-physical address or device-usable buffer handle is exported to
userspace and host_physical_user_visible=0 is preserved. capOS method
convention adds the result/reason/sideEffect evidence triple and the
observed EtherType to the result:
interface Nic {
transmit @0 (frame :Data)
-> (result :Text, reason :Text, sideEffect :Text);
receive @1 ()
-> (frame :Data, observedEthertype :UInt16,
result :Text, reason :Text, sideEffect :Text);
macAddress @2 () -> (addr :Data);
linkStatus @3 () -> (up :Bool);
}
Why inline Data, not a zero-copy buffer handle. A DmaBuffer-handle
(zero-copy) ABI was considered – it would avoid the per-frame copy – but
rejected to keep the change small: it introduces a new buffer-ownership protocol
across the cap boundary (who allocates, who frees, lifetime versus the call) on
top of the security work this track already requires, for a copy cost that does
not matter at research scale. Inline Data matches the accepted proposal draft,
keeps the frame staging kernel-owned exactly as the polled provider proved, and
defers any zero-copy optimization to a later, separately justified slice.
The Common-Config Window (Selected-Write, No New Ruling)
Relocation writes the virtio common-config window from userspace through a
bounded, range-checked, selected-write path modeled on the existing single
selected write (notify_doorbell @5, kernel/src/cap/device_mmio.rs;
device_manager::provider_notify_doorbell_write_for_cap). This is the next
register in the accepted selected-write pattern, not a new security relaxation
requiring a ruling. DeviceMmio already refuses raw write32
(register_write = "blocked") and admits exactly the claimed selected write,
range-checked against the decoded BAR and followed by a kernel-asserted
read-back; the handshake registers are added to that same admission list under
the same discipline.
The bounded design:
- A selected-write common-config window: only the named virtio common-config registers needed for handshake (device status, feature-select, device-feature, driver-feature) are writable in slice 1, each range-checked against the claimed BAR.
- Read-back-assertion discipline: every selected write is followed by a read-back the kernel asserts, so a userspace driver cannot leave the device in an unverified state.
- Queue-address registers stay fail-closed in slice 1 and are admitted to the same selected-write list only in slice 2, where each programmed value must resolve to a device-usable address the writing driver was granted (the DMA-isolation discipline below decides), so userspace can never point the device at arbitrary physical memory.
No new ruling is pending: the project already decided this posture through the accepted selected-write discipline and the IOVA-export discipline below. Slice 1 is ready.
The Userspace-Ownable vring Slice (Reuses Landed DMA Isolation)
The expensive-sounding piece – slice 2, a userspace-ownable vring plus a device-usable buffer-address export – is wiring already-landed isolation to the driver’s vring, not building isolation from scratch. The two backends and the no-host-physical export discipline are all landed:
- Bounce-buffer path (production default on no-IOMMU shapes). The runtime
DMA-backend probe (
kernel/src/dma_backend.rs,select_and_report/probe_verified_usable_iommu) selects the labeled bounce-buffer fallback fail-closed when no usable guest IOMMU is verified, and the manager-owned bounce-bufferDMAPool/DMABufferlifecycle (kernel/src/device_dma.rs, scrub-before-free, owner/slot generations, quiesce-before-release) is the landed authority a driver uses for device-visible buffer memory (cloud-prod-dmapool-bounce-buffer-grant-proof). - IOMMU-IOVA path (graduates when the probe verifies usable hardware). The
Intel VT-d remapping path (
kernel/src/iommu.rs,cfg(qemu)today) programs per-device domains, maps manager-ownedDMAPoolpages, and exports only a domain-scoped IOVA – never a host-physical address (ddf-iommu-remapping-production-closeout,ddf-iommu-production-dmapool-ledger-integration,ddf-iommu-per-device-domain-granularity,ddf-iommu-production-revoke-teardown-hostile-smokes,ddf-real-dma-iommu-direct-path). - No-host-physical export discipline. The IOVA-export discipline
(
ddf-iommu-iova-export-discipline) and thehost_physical_user_visible = false/iova_export = "disabled-future-only"posture (kernel/src/cap/dma_buffer.rs) guarantee a driver receives only a device-usable address (bounce handle or domain-scoped IOVA), never a raw host physical address.
The accepted contract is docs/dma-isolation-design.md (“Cloud DMA Backend”
runtime-selection rule and the IOVA-export-discipline clause), and the S.11.2
hostile-smoke matrix is already enforced for both backends. The remaining slice-2
work is therefore to let the userspace driver allocate its vring through the
granted DMAPool (bounce, or IOVA-backed when the probe verifies usable
hardware), learn the device-usable address for each ring, and program those
addresses into the queue-address registers over the slice-1 writable window –
under the landed fail-closed / scrub / quiesce / revoke discipline. The
networking-proposal Phase C prerequisites table and S.11.2 are satisfied by
the landed track, not deferred to it.
Bounded Slice Chain
- Userspace status + feature handshake to FEATURES_OK over a writable
selected-write common-config window (the next register in the accepted
notify-doorbell selected-write discipline). [DONE 2026-06-02.] The
cap::devicemmio_grant_source_prodsource stages the virtio-net common-config window as a writable selected-writeDeviceMmiogrant (stage_virtio_net_common_config); the userspace shim drives the handshake overDeviceMmio.read32/write32, the write admission (device_manager::stub::write_devicemmio_u32) admits only the four handshake registers (range-checked + read-back-asserted) and refuses queue-address writes; the unimplementedNicstub is inschema/capos.capnp. Proofmake run-cloud-prod-nic-driver-userspace-features-ok. Task record:docs/tasks/done/2026-06-02/cloud-prod-nic-driver-userspace-features-ok-local-proof.md. - Userspace-ownable vring + device-usable address export – reuses the
landed production DMA isolation (bounce policy +
dma_backendprobe + IOMMU IOVA-export, S.11.2 already enforced); the work is wiring it to the driver’s vring. [DONE 2026-06-03.] Undercloud_virtio_net_userspace_ownable_vring_proof(implies slice 1) the userspace shim co-receives the writable common-configDeviceMmiogrant and a bounce-bufferDMAPoolgrant on the same virtio-net function; it allocates its descriptor / available / used ring pages, learns each buffer’s opaque device-usable handle fromDMABuffer.info(deviceIova, scopebounce-handle), and programsqueue_desc/queue_driver/queue_deviceover the slice-1 window.device_manager::stub::write_devicemmio_u32(admit_virtio_queue_address_write) resolves each handle against the liveDMAPoolgrant ledger (resolve_virtio_vring_device_address) to the real bounce host-physical address, programs that address (never the handle), and read-back-asserts; queue-address reads (0x20..0x38) are refused so the host-physical address is never exposed, and out-of-grant / host-physical / stale-generation writes fail closed.queue_enablestays fail-closed. Proofmake run-cloud-prod-nic-driver-userspace-ownable-vring. Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-ownable-vring-local-proof.md. - Userspace queue-program + DRIVER_OK over the (now device-addressable)
vring. [DONE 2026-06-03.] Under
cloud_virtio_net_userspace_queue_enable_driver_ok_proof(implies slice 2) the userspace shim completes device bring-up: after slice 2’s queue-address programming it writesqueue_enable = 1(0x1c) for its programmed TX queue and setsDRIVER_OKover the already-writable device-status register.device_manager::stub::write_devicemmio_u32admits thequeue_enablewrite only when the active queue’s vring memory is live and page-fitting (selected_queue_ready_to_enable): it reads the activequeue_desc/queue_driver/queue_deviceback kernel-side and requires each to currently hold the host-physical address of a live grantedDMABuffer(a freed buffer’s stale address cannot arm a use-after-free DMA target), and requires the activequeue_sizeto fit every split-ring structure inside one granted bounce page; an enable of an unprogrammed, freed, or oversized queue fails closed, and the enable is read-back-asserted. Once enabled, the queue’s vring base registers are immutable – a queue-address repoint is refused so the driver cannot mutate the vring under a running device. TheDRIVER_OKdevice-status write is kernel-asserted: the kernel re-reads device-status and fails closed unless the device latched theACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OKbyte exactly (rejectingFAILEDandDEVICE_NEEDS_RESET). Queue-address reads stay refused; no host-physical is exposed; no new DMA isolation backend. Proofmake run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok. Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-queue-enable-driver-ok-local-proof.md. 4a. Userspace RX queue 0 bring-up + buffer-identity binding + ring-buffer pinning, then the first real RX DMA from the shim-owned vring. Split into two landed/ready sub-slices because the real-DMA hybrid bring-up is large:- 4a-i [DONE 2026-06-03]. The shim brings up RX queue 0 over its own
vring (slices 1-3 brought up only the TX queue; the
queue_enableadmission is queue-agnostic).device_manager::stubretains each programmed queue’s vring physes + originatingDMABufferhandle identity onProductionDeviceRecord(admit_virtio_queue_address_write), bindsqueue_enableto that identity (a freed buffer’s stale handle, or a freed-then-reallocated frame at the same host-physical address, fails closed withdevicemmio-queue-enable-identity-mismatch), and pins the ring buffers againstfreeBuffer/ process-teardown release while the queue is enabled (dmabuffer-pinned-enabled-vring), releasing only on disable/reset with quiesce. This completes the vring buffer-lifetime binding slice 3 left point-in-time at the bring-up boundary. No device DMA. Proofmake run-cloud-prod-nic-driver-userspace-rx-bringup. Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-bringup-and-buffer-pinning-local-proof.md. - 4a-ii [DONE 2026-06-03]. The first real RX DMA from the shim vring:
the shim also brings up TX queue 1 over its own vring, posts one
device-writable RX receive buffer on queue 0 (
DMABuffer.submitDescriptor), and rings the productionDeviceMmio.notifyDoorbell @5(the previouslyErr(stale_handle)provider_notify_doorbell_write_for_cap, now live;cap::devicemmio_grant_source_prodmaps the notify region kernel-side and captures the per-queue notify slot offsets). The kernel (cap::virtio_net_userspace_rx_dma_proof) drives the RX publish over the shim’s retained RX physes + a kernel-half SLIRP TX ARP stimulus over the shim’s retained TX physes + one real device->host RX DMA (used_len > 0, observed EtherType0x0806), latches the used-ring index (int_injected = 0, noInterruptcap), and resets the device – quiescing the queues and releasing the ring-buffer pins – WITHOUT theNiccap. Proofmake run-cloud-prod-nic-driver-userspace-rx-bringup(extended). The deterministic freed-then-reallocated-frame identity negative is split to a follow-up: the next-fit frame allocator (capos-libFrameBitmap,free_framedoes not rewindnext_hint) never returns a just-freed frame on the next allocation, so a deterministic same-phys realloc – needed to reach the slice-4a identity gate rather than the slice-3 phys gate – requires an allocator reuse seam. The data path is cooperative-shim-safe (kernel authors the descriptor + avail inside the drive window, re-validates the posted buffer live + unmapped before publishing, resets on every post-publish path); the HOSTILE-shim residuals (kernel-exclusive/unmappable enabled vring rings + reset-failure payload quarantine) are closed in 4a-iv; the identity negative is closed in 4a-iii. Task records:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-local-proof.md; follow-upsdocs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md,docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md. - 4a-iv [DONE 2026-06-03]. HOSTILE-shim hardening over the 4a-ii data
path. The shim owns its vring ring buffers (granted bounce DMABuffers it can
DMABuffer.map), so the kernel cannot trust ring contents it did not author and lock. The closed gaps: (1) a programmed/enabled vring ring buffer is made kernel-exclusive –validate_dmabuffer_map_admissionrefuses a NEWDMABuffer.mapof it while the queue is enabled,selected_queue_identity_boundrefusesqueue_enablewhile a ring buffer still carries a live user mapping (a kept pre-enable VMA) and refuses arming a queue whose ring buffers are not pairwise-distinct pages – both within the queue and across the device’s other enabled queues (an aliased desc/driver/device, orrx.desc == tx.desc, would let the kernel-authored ring writes corrupt a descriptor into a non-bounce DMA target), andrecord_dmabuffer_user_mappingrefuses to record a mapping on a programmed/enabled ring (an in-flight SMP map), and the RX-DMA submit/drive admission refuses to publish an enabled ring buffer as the RX DMA payload (the shim cannot point the device at its own ring page); (2) atqueue_enablethe kernel WIPES the queue’s descriptor table slot 0, available ring, AND used ring (virtio_net_userspace_rx_dma_proof::sanitize_enabled_queue_rings), so a shim that pre-wroteavail.idx/ a tampered descriptor / a spoofedused.idxwhile the queue was disabled cannot pre-publish it into the enabled window – the device sees an empty queue with no pre-staged completion until the kernel-authored drive publishes and a real device DMA advancesused.idx; (3) a per-bdf RX-DMA payload drive pin / reset-failure quarantine (device_manager::stubbegin_rx_dma_drive_pin/clear_rx_dma_drive_pin/mark_rx_dma_payload_quarantine_permanent, consulted by themapadmission + the record path + thefreeBuffer/teardown detach) is set atomically with the live+unmapped re-validation under the device-table lock for the drive duration, cleared after a confirmed reset, and promoted to a permanent quarantine on the catastrophic reset-failure path (never downgraded by a later drive). The smoke (make run-cloud-prod-nic-driver-userspace-rx-bringup, extended) proves a still-mapped ring blocksqueue_enable, the post-enable map refusal on the descriptor + driver rings, and that a hostile pre-enable descriptor/avail.idxtamper does NOT survive (RX DMA still completes withused_id=0, ARP EtherType). The drive pin’s SMP map/free race and the reset-failure quarantine are structural fail-closed hardening: the single-CPU cloudboot proof cannot reach the race and QEMU virtio reset always succeeds, so they are not separately QEMU-observable. Residual [CLOSED 2026-06-03 in 4a-v]: two SMP-only, bounce-confined races betweenDMABuffer.mapand a device-authority state transition – the cap-sidemap_page_into_userinstalled the user PTE before the manager-side record (a manager-rejected map left a transient provisional PTE), and thequeue_enableno-live-mapping check was not atomic with the retainedenabledflag flip (a concurrent map could record a mapping in the enable window) – were not reachable by the single-CPU proof and are closed by slice 4a-v below. Predecessor task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md. - 4a-iii [DONE 2026-06-03]. The deterministic
freed-then-reallocated-frame
queue_enableidentity negative carved out of 4a-ii. A proof-only one-shot bounce-frame reuse seam (mem::frame::proof_try_alloc_specific_frame_zeroedconsumed by adevice_dmareuse hint armed on each bounce free, both gated behindcloud_virtio_net_userspace_rx_bringup_proof, never compiled into production) makes a same-host-physicalDMABufferrealloc reachable from the userspace harness despite the production next-fitFrameBitmap. The smoke programs a transient ring buffer into rxqueue_desc, frees it, relands a fresh-handle buffer on the same frame, and provesqueue_enablefails closed on the slice-4a identity gate (authority_result = devicemmio-queue-enable-identity-mismatch) – the recorded phys is live again so the slice-3 phys gate passes, and the marker flipsidentity_realloc_negative=enforced. The seam does not relax the next-fit policy or the identity gate. Proofmake run-cloud-prod-nic-driver-userspace-rx-bringup(extended). Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md. - 4a-v [DONE 2026-06-03]. Closes the two SMP-only, bounce-confined
DMABuffer.map-vs-device-authority races split out of 4a-iv. (1)kernel/src/cap/dma_buffer.rs::map_page_into_usernow takes the manager-side mapping record (record_dmabuffer_user_mapping, which acquires the device-table lock) BEFORE installing the user PTE, with the caller address-space lock held across the manager call (lock order address-space -> device-table -> quarantine, no reverse nesting): a manager-rejected map returns before any PTE is installed, so a concurrent SMP thread can never touch a transient provisional mapping at the deterministic auto-picked base. (2)kernel/src/device_manager/stub.rs::try_enable_selected_queuefolds thequeue_enableno-live-mapping identity check and a retainedenable_pinnedcommit pin into one device-table critical section, set before the MMIO arm write and rolled back on a readback mismatch, so a concurrent map cannot record a mapping in the enable window. The commit pin is kept distinct from the device-armedenabledbit (set after the MMIO write): map admission /freeBufferpinning consultenabled || enable_pinned, while the RX-DMA drive gates onenabledalone, so a concurrent drive cannot run – and reset-clear the retained state – inside the arm window. A per-queueenable_in_progressclaim serializes concurrentqueue_enabletransitions (enable vs enable, enable vs disable) so a racingqueue_enable = 0can never clear an in-flight enable’s commit pin, and the drive’s reset cleanup skips clearing while a transition is in flight so a stale drive cannot clear a newer enable’s pin. The two transitions now serialize on the device-table lock: one fails closed. The interleavings are not single-CPU-reachable and the live kernel statics are not host-testable, so the proof is an exhaustive Loom interleaving model (capos-config/tests/dmabuffer_map_enable_loom.rs, run viacargo test-dmabuffer-map-enable-loom) that asserts no schedule arms a queue with a live user PTE on its ring buffer or installs a PTE without an accepted record; the single-CPUmake run-cloud-prod-nic-driver-userspace-rx-bringupregression continues to pass unchanged. Residual [CLOSED 2026-06-03]: the follow-up fenced the drive’s post-reset cleanup on a per-queue transition generation.RetainedVringQueue::generationis bumped on each completed transition (mark_selected_queue_armedenable arm +finish_selected_queue_disabledisable finish); the drive reads it (retained_queue_generation) IMMEDIATELY beforereset_device– the reset boundary, not the far-earlierenabledsample – and the cleanup (mark_retained_vring_queue_disabled_if_epoch) clears the pins only while the queue is still in the epoch the reset quiesced. A disable + re-enable that fully completes AFTER the reset advances the generation, so the cleanup becomes a no-op instead of clearing the freshly armed epoch’s pins; one that completed BEFORE the reset is included in the captured generation, so its pins clear (no stale over-pin). The only residual is a transition completing in the tiny read->reset-MMIO gap whose arm loses the race: a fail-closed over-pin, never an under-pin. The Loom model gainedfenced_stale_drive_cannot_clear_a_completed_re_enables_pins, which fails with the fence removed (GENERATION_FENCE = false). Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-queue-enable-drive-reset-epoch-fence-local-proof.md. Predecessor (4a-v):docs/tasks/done/2026-06-03/dmabuffer-map-record-before-pte-install-ordering-local-proof.md. 4b.Nic-cap driver process, coupled TX/RX round-trip [DONE 2026-06-03]. Implements the slice-1Nicinterface stub as a liveCapObject(kernel/src/cap/nic_grant_source_prod.rs, granted via the newnicKernelCapSourceregistered incapos-config; clientNicClientincapos-rt). Over the SAME shim-brought-up device (RX queue 0 + TX queue 1 enabled by slices 1-3 + 4a), the cap drives the shim’s retained vring physes throughvirtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive, nic_quiesce}:receive()internally drives the coupled ARP-TX-stimulus + RX-poll and returns the received frame inline plus the observed EtherType;transmit()stages a frame into a manager-owned TX bounce page over the retained TX vring and rings the doorbell;macAddress()/linkStatus()read the kernel-mapped virtio-net device-config region. Frames cross the cap boundary as inlineDatacopied through manager-owned kernel bounce pages (host_physical_user_visible = 0; no host-physical / device-handle exposure); the device is left live for the cap’s lifetime and quiesced once on cap release (nic_quiesce: reset + queues-cleared + release the enabled-vring pins). Completion stays kernel-latched used-ring polled (int_injected = 0, noInterruptcap). The clean independent TX/RX split is deferred to slice 6; userspace IRQ ownership to slice 5. Proofmake run-cloud-prod-nic-driver-userspace-nic-cap-roundtripboots the device from userspace, round-trips two sequential frames through the typedNiccap (observed EtherType0x0806over QEMU SLIRP), and emits onecloudboot-evidence: nic-driver-userspace-nic-cap-roundtrip <token>marker withroundtrips=2. The same proof releases the parentDMAPoolcap and one pinned ringDMABuffercap beforeNicrelease, then showsNicquiesce replaying the blocked buffer detach and the pending parent pool detach completing after the remaining ring buffers are freed. Depends on 4a-ii (the shim-owned-vring data path). Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-nic-cap-roundtrip-local-proof.md.
- 4a-i [DONE 2026-06-03]. The shim brings up RX queue 0 over its own
vring (slices 1-3 brought up only the TX queue; the
- Userspace IRQ ownership [DONE 2026-06-03]. The userspace NIC driver’s
Interruptcap for its device RX route becomes real, replacing slice-4b’s kernel-latched used-ring polled completion (int_injected = 0). Undercloud_virtio_net_userspace_irq_ownership_proof(implies slice 4b) a newInterruptgrant source (kernel/src/cap/virtio_net_userspace_irq_ownership_proof.rs, replacing the admission-onlyinterrupt_grant_source_prodsource via theKernelCapSource::Interruptarm) programs the staged virtio-net function’s RX MSI-X route (entry 0) mask-first through the landed always-builtcap::interrupt_programmed::program_attach_arm_unmaskand issues a cap whose:waitblocks on a real interrupt dispatch through the route’s MSI-X / LAPIC dispatch slot (device_interrupt::wait_kernel_injected_dispatch;delivery_countadvances, soint_injectedflips from 0 – slice 4b had noInterruptcap on the data path at all). The wake is a bounded kernel-injected dispatch through the route’s real deferred-EOI machinery, not yet a device-autonomous MSI-X raise causally tied to a specific frame;Nic.receivestill reads the frame bytes from the used ring, so this slice delivers IRQ-lifecycle ownership (the driver drives the realwait/acknowledge/mask/unmask), not interrupt-coalesced RX completion.acknowledgeretires exactly one deferred LAPIC EOI throughdevice_interrupt::acknowledge_deferred_lapic_eoi_for_route(hardwareDispatchAckDelta = 1); andmask/unmasktoggle the route’s own MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2) plus the manager-attached route state, throughpci::set_msix_table_entry_mask+device_interrupt::{mask,unmask}_device_manager_attached_route. The route is torn down (interrupt_programmed::teardown) on cap release. The driver holds thisInterruptcap alongside the slice-4bNic/DeviceMmio/DMAPoolcaps on the same function; it brings the device up from userspace, then drives the owned RX-interrupt lifecycle and reads the completed frame back throughNic.receive. The PCI function-level MSI-X enable bit is not toggled and no device-autonomous raise is attempted (device_autonomous_raise=not-attempted,waiter_wake=kernel-injected-dispatch); the landed DMA isolation, the slice-2/3 vring grants, and the buffer-identity / ring-buffer pinning are reused unchanged (host_physical_user_visible = 0, queue-address reads still refused). No newInterruptinterface or method (the existingwait/acknowledge/mask/unmaskbecome real for this route). Proofmake run-cloud-prod-nic-driver-userspace-irq-ownershipemits onecloudboot-evidence: nic-driver-userspace-irq-ownership <token>marker. Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-irq-ownership-local-proof.md. - Clean TX/RX split [DONE 2026-06-03]. Decouples the coupled receive path
into independent TX and RX submission. Under
cloud_virtio_net_userspace_clean_tx_rx_split_proof(implies slice 5) theNiccap’sreceive @1dispatches to the newvirtio_net_userspace_rx_dma_proof::nic_receive_independentinstead of the couplednic_receive: it posts a manager-owned RX receive buffer on the retained RX vring, waits on the driver’s OWNED RX interrupt route (the slice-5device_interrupt::wait_kernel_injected_dispatchdispatch slot, resolved throughvirtio_net_userspace_irq_ownership_proof::owned_rx_route), retires the deferred LAPIC EOI, and reads the completed frame from the RX used ring – with no internal ARP-TX self-stimulus (it never submits to the TX vring). The RX frame is driven by an external stimulus: the consumer’s preceding independentNic.transmitof a real broadcast ARP request, which QEMU SLIRP answers; the inbound reply is held in the host net queue until the RX buffer is posted.Nic.transmitstays independent (submits the caller’s frame and rings the TX doorbell with no RX poll; surfacesrx_polls=0). The wake stays the bounded kernel-injected dispatch slice 5 owns (waiter_wake=kernel-injected-dispatch,device_autonomous_raise=not-attempted). Reuses the landed owned-vring / owned-IRQ / DMA-isolation unchanged: no new selected-write register, no new MSI-X surface, no newNic/Interruptmethod (make generated-code-checkgreen), no host-physical / handle exposure (host_physical_user_visible = 0, queue-address reads refused). The driver does an independenttransmitthen a separate independentreceive, neither performing the other’s submission (tx_independent=ok,rx_independent=ok,receive_self_stimulus=removed). Proofmake run-cloud-prod-nic-driver-userspace-clean-tx-rx-splitemits onecloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token>marker. Task record:cloud-prod-nic-driver-userspace-clean-tx-rx-split-local-proof. - Network-stack process + smoltcp relocation – a second userspace process
holding the
Niccap and a bounded time source, runningsmoltcp, implementing the socket caps while preserving thecap/network.rscontract. Slice 6 is now done, so this slice is decomposed into bounded increments rather than attempted as one step:- 7a (first increment, DONE 2026-06-03): network-stack-process skeleton. A
userspace process runs a minimal
smoltcpInterfaceover aphy::Deviceadapter backed by the landed independentNic.transmit/Nic.receive(slice 6), clocked by aTimercap, and drives one observable Ethernet exchange through SLIRP:smoltcp– not hand-rolled frame code – ARPs the gateway out throughNic.transmit, consumes the reply in throughNic.receive, and emits the queued IPv4/UDP datagram, so the neighbour cache observably advances. No socket caps, nocap/network.rsrelocation,virtio_stub.rsunchanged. Proofmake run-cloud-prod-network-stack-process-smoltcp-skeleton. Implementation note: the landedNiccap is not yet self-sufficient – itstransmit/receiveride on the userspace driver shim’s retained vring (the kernel does not own the vring), so the skeleton process performs the slice-1-6 bring-up itself (it also holds theDeviceMmio/DMAPool/Interruptcaps) before runningsmoltcp. Splitting the bring-up into a separate long-lived NIC-driver service, so the network-stack process holds onlyNic+Timer+Console, is folded into the 7c contract-relocation increment; it does not change the proven smoltcp-substrate claim. Task record:cloud-prod-network-stack-process-smoltcp-skeleton-local-proof. - 7b (socket layer, DONE 2026-06-03): socket caps over the userspace
smoltcp stack – a userspace
UdpSocketcap layer (UdpSocketCapLayer) implements theUdpSocketschema’ssendTo/recvFromsemantics over the 7aInterfaceand proves one bounded UDP request/response: a DNS A query forexample.comto SLIRP’s resolver at10.0.2.3:53viasendTo, then the decoded response viarecvFrom. smoltcp drives every frame through theNiccap (ARP reply + DNS reply both fetched throughNic.receive,host_physical_user_visible = 0preserved);Timerclocks the poll. The socket layer is in-process – it does not yet serve the socket interfaces as inter-process transferable capabilities, and it does not touchcap/network.rs(virtio_stub.rsstays fail-closed). Proofmake run-cloud-prod-network-stack-smoltcp-socket-caps(onecloudboot-evidence: network-stack-smoltcp-socket-caps <token>marker). Task record:cloud-prod-network-stack-smoltcp-socket-caps-local-proof. Serving the socket interfaces as inter-process caps (aNetworkManager-like broker) andTcpListener/TcpSocketare folded into the 7c contract relocation. - 7c (contract relocation): preserve the
cap/network.rscontract behind the userspace network-stack process so the production L4 entry points (virtio_stub.rs) stop returningDeviceUnavailablefor the armed manifest. This is the body of the whole-slice record (cloud-prod-userspace-network-stack-smoltcp-local-proof), decomposed because it bundles three independently-large pieces:- 7c-i (inter-process socket cap, DONE 2026-06-03): serve the slice-7b
userspace
UdpSocketCapLayeras a real inter-process transferable capability. A network-stack server process brings the device up, builds the userspacesmoltcpUdpSocketlayer, and serves theUdpSocketschema (sendTo/recvFrom/close) over an exportedEndpoint; a separate client process re-interprets the served cap as aUdpSocketand drives one bounded DNS A query/response through the productionUdpSocketClient, withsmoltcpstill moving every frame through theNiccap (host_physical_user_visible = 0). Proofmake run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc. This is the prerequisite for the kernel contract relocation.cap/network.rs/virtio_stub.rsare unchanged. Task record:cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof. - 7c-ii (
cap/network.rsrelocation, DONE 2026-06-07): route the armed Phase C manifest’s production L4 socket entry points to the userspace network-stack service so the local proof no longer reachesvirtio_stub.rsDeviceUnavailable. This is itself decomposed (see “7c-ii Mechanism and Decomposition” below) into 7c-ii(a) – serveTcpListener/TcpSocketas inter-process caps with accept-returns-a-result-cap, an architecture-agnostic prerequisite (cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof) – and 7c-ii(b) – the local serve-from-userspace proof. In 7c-ii(b) (DONE 2026-06-07), the network-stack process boots the non-qemucloudboot kernel, spawns an application client with onlyConsoleplus a userspace-servedTcpListenAuthority, returns a userspace-servedTcpListenerfromlisten, returns aTcpSocketresult cap fromaccept, and completes one hostfwd TCP request/response throughrecv/send. The selected architecture remains serve-from-userspace: applications consume the existing typed socket interfaces from a userspace network-stack process instead of extending the legacy kernel-routed socket owner. - 7c-iii (
TcpListener/TcpSocket, DONE 2026-06-04): the userspacesmoltcpstack now serves a realTcpListener/TcpSocketround trip over theNiccap, using the sustained-receiveNic.receivePoll @4(slice 7d) for the multi-frame TCP exchange. A single cloudboot service brings the device up from userspace, runs asmoltcpTCP socket listening on port 8080 driven by the non-resettingreceivePoll @4pump, and – against an external host TCP client over a QEMUhostfwdrelay – completes one bounded TCP handshake + request/response (asserting the received request equals the expected probe and echoing it back).smoltcp, not hand-rolled frame code, moves every frame; the device stays armed across the SYN/SYN-ACK/ACK/request/response/FIN exchange with no per-frame reset.host_physical_user_visible = 0; queue-address reads refused; the bounce RX pool quiesces + scrubs on teardown. This increment proved the generic TCP socket substrate in-process; 7c-ii(a) later servedTcpListener/TcpSocketinter-process with accept-returns-a-result-cap. Neither increment changescap/network.rs/virtio_stub.rs; that final production-manifest wiring is 7c-ii(b). Proofmake run-cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip. Task record:cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip-local-proof. Built on the prerequisitecloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof.
- 7c-i (inter-process socket cap, DONE 2026-06-03): serve the slice-7b
userspace
- 7a (first increment, DONE 2026-06-03): network-stack-process skeleton. A
userspace process runs a minimal
- Kernel
smoltcp/ virtio-net removal (Phase C exit) – done 2026-06-08. The kernel no longer depends onsmoltcp, and the qemu-onlycap/network.rssocket entry points now fail closed instead of reaching an in-kernel TCP/UDP runtime. The retained virtio-net code is a lower-layer QEMU fixture for PCI/MMIO/virtqueue, ARP, ICMP, and descriptor-generation proofs; it is not the production cloud socket owner. Task record:cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal.
Each slice is parented to the appropriate predecessor; slices 7-8 deliver the L4 socket reachability the scope decision sequenced after the cloud milestone.
7c-ii Mechanism and Decomposition
7c-ii is “preserve the cap/network.rs contract behind the userspace
network-stack process.” Before the serve-from-userspace path landed, the
non-qemu cloud kernel could still grant production L4 methods
(NetworkManager.createTcpListener, TcpListenAuthority.listen,
TcpListener.accept, TcpSocket.send/recv, the UDP methods) through
kernel-side CapObjects in kernel/src/cap/network.rs that call
crate::virtio::*. In that build, crate::virtio resolves to
kernel/src/virtio_stub.rs: creation entry points return
NetworkError::DeviceUnavailable, while existing-handle operations fail closed
with invalid-handle errors. Before Phase C exit cleanup, the working smoltcp
runtime existed only in the cfg(qemu) kernel/src/virtio.rs; it has now been
removed, and the qemu socket entry points match the same fail-closed shape.
Relocating the contract means production L4 methods are satisfied by the
userspace network-stack service instead of the stub, and non-qemu bootstrap
grants for the legacy kernel NetworkManager and TcpListenAuthority sources
now fail closed before those CapObjects are minted.
Mechanism constraint. A literal reading – the kernel keeps serving the L4
caps and forwards each call to the userspace service – requires the kernel to
originate a capability Call to a userspace-served Endpoint and complete the
original caller once the service returns. That kernel-as-client-of-a-userspace-
service path does not exist today: the ring/endpoint machinery
(kernel/src/cap/ring.rs, kernel/src/cap/endpoint.rs,
kernel/src/cap/transfer.rs) only dispatches SQEs from a userspace process’s
ring; the kernel never enqueues a Call against a userspace endpoint nor parks
waiting for the Return. Building that inversion is a new kernel IPC subsystem,
not a drive-by, and it adds kernel surface that the Phase C exit deliberately
avoids by retiring the kernel L4 owner instead.
7c-ii(b) Architecture Decision: Serve From Userspace
On 2026-06-07 the operator selected serve-from-userspace for the final
production-manifest wiring. The armed Phase C manifest receives a
userspace-served NetworkManager or TcpListenAuthority cap from the
network-stack service. The legacy kernel cap/network.rs / virtio_stub.rs
socket path is now fenced to cfg(qemu) fixture manifests and stale negative
paths: non-qemu manifests that request kernel network_manager or
tcp_listen_authority are rejected during bootstrap, so missing served
authority does not fall back to the old kernel socket owner.
The rejected alternative for this stage is kernel-brokered forwarding: keeping the kernel as the socket cap server while it calls into a registered userspace network-stack service. That route would add a new kernel-originated Endpoint/deferred-completion subsystem. The selected direction keeps the microkernel boundary cleaner by moving behavior to userspace where that does not compromise security.
Grounding for this decision:
kernel/src/cap/network.rsimplements the current kernel-servedNetworkManager,TcpListenAuthority,TcpListener,TcpSocket, andUdpSocketobjects, including result-cap transfer for listener and socket creation.kernel/src/virtio_stub.rsremains the non-qemunegative-result endpoint for stale kernel networking call sites, but bootstrap no longer grants itsNetworkManagerorTcpListenAuthoritycallers in production manifests.kernel/src/cap/ring.rs,kernel/src/cap/endpoint.rs, andkernel/src/cap/transfer.rsimplement userspace-originatedEndpointcalls, receive/return, and capability transfer. They do not currently implement a kernel-originated call into a userspace endpoint.- The 7c-i and 7c-ii(a) task records prove userspace-served socket caps over
the existing
Endpointand RETURN result-cap path:cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proofandcloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof. docs/research/capability-systems-survey.mdanddocs/research/spritely-captp-ocapn.mdreinforce the local capOS rule used here: explicit object references are authority, and forwarding/proxying should make authority flow and lifetime explicit rather than hiding it behind ambient names.
Current constraints that both options must preserve:
- Existing caller contract.
NetworkManager.createTcpListener,TcpListenAuthority.listen,TcpListener.accept,TcpSocket.send/recv, and the UDP methods keep the same typed Cap’n Proto surfaces unless a later implementation task explicitly changes schema and regenerated bindings. - Manifest authority. Only the armed Phase C manifest receives the new production L4 path. Missing registration, missing served cap, stale service identity, or an unarmed manifest fails closed instead of falling back to a broad kernel escape hatch; the legacy kernel socket sources are qemu-only fixtures.
- No new packet authority. The socket service consumes the existing
Niccap and the landedreceivePoll @4sustained-receive path. This decision does not add DMA, MMIO, IRQ, queue-address, host-physical, GCE, public ingress, TLS, or certificate authority. - Private/local Web UI critical path. The first Web UI proof remains local and then private GCE reachability. Public ingress, TLS, firewall/DNS exposure, and live-cloud runs stay gated by their separate task records.
| Axis | Kernel-brokered forwarding (rejected for 7c-ii(b)) | Serve-from-userspace (selected) |
|---|---|---|
| Cap server identity | The kernel keeps serving cap/network.rs objects for the armed manifest and forwards each socket operation to a registered userspace network-stack service. | The armed manifest receives a userspace-served NetworkManager or TcpListenAuthority cap from the network-stack service, following the 7c-i / 7c-ii(a) serving pattern. |
| Kernel IPC delta | Adds a kernel-as-client forwarding path: service registration, endpoint identity validation, kernel-originated call construction, transfer handling, cancellation, and deferred completion back to the original caller. | Adds no kernel-originated endpoint call path. The existing userspace caller-to-userspace endpoint path carries the socket methods and result caps. |
| ABI and IPC risk | Higher. The existing ring/endpoint code accepts userspace SQEs and endpoint returns; it does not yet encode the kernel as an endpoint caller. The implementation must specify caller-session metadata, transfer rollback, service disappearance, cancellation, and result-cap insertion semantics for forwarded calls. | Lower kernel ABI risk. Schema can remain unchanged if the served cap implements the current socket interfaces. Risk shifts to manifest grant wiring, service startup ordering, endpoint lifetime, and making stale or missing service authority fail closed. |
| Production-surface compatibility | Preserves the literal “kernel-routed cap/network.rs surface” for the armed manifest, which minimizes caller-visible routing change. | Makes the production socket cap userspace-served for the armed manifest. Callers still use the same typed socket interfaces, but the authority source is the network-stack service rather than cap/network.rs. |
| Fail-closed behavior | Requires a bounded registration table and an explicit “no registered service” error path. Forwarding must not silently reach virtio_stub.rs except as a deliberate unarmed-manifest failure mode. | Uses manifest-level grant selection and service liveness checks. If the served cap is absent, stale, or not granted, the non-qemu manifest fails closed; virtio_stub.rs remains only a stale-call/fixture negative path. |
| Validation burden | Must prove forwarding with at least one socket operation, result-cap transfer through the forwarded path, service crash/cancel cleanup, missing-service failure, and no kernel-thread parking. Before Phase C exit, make run-net covered the old qemu-only socket path. | Must prove the armed manifest gets the served network cap, completes the socket round trip through the userspace service, rejects an unarmed or missing-service manifest, and preserves the existing inter-process socket-cap proofs. Phase C exit keeps only lower-layer QEMU virtio-net fixture coverage after the old kernel L4 owner is removed. |
| Phase C exit interaction | Leaves a new kernel forwarding subsystem after kernel smoltcp and virtio-net hot-path removal. Slice 8 would have had to decide whether that subsystem was permanent generic IPC infrastructure or Phase C-specific scaffolding. | Aligns with the Phase C exit direction that kernel networking becomes routing/grant setup while L4 service behavior lives out of kernel. Slice 8 removed kernel smoltcp/virtio-net L4 ownership and stale socket paths. |
Follow-up task-state changes after the selection:
cloud-prod-userspace-network-stack-smoltcp-local-prooflanded the local serve-from-userspace proof: manifest grant wiring, network-stack service startup/liveness, servedTcpListenAuthorityauthority, and the final socket round-trip proof. No kernel-originatedEndpointcall task is implied by this choice.cloud-prod-legacy-kernel-network-socket-path-retirementretires the armed-manifest route through the legacy kernel socket owner by rejecting non-qemukernelnetwork_manager/tcp_listen_authoritygrants, preserving userspace-served authority as the production path.cloud-prod-phase-c-kernel-smoltcp-virtio-net-removalis done. It removes the kernelsmoltcpdependency, retires the qemu-only kernel TCP/UDP runtime behind fail-closed socket entry points, and documents the remaining virtio-net coverage as lower-layer QEMU fixture evidence rather than production cloud socket ownership.network-socket-terminal-session-userspace-migrationis the existing follow-up for moving kernel-owned socket terminal byte handling out ofkernel/src/cap/network.rs.- Keep
cloud-prod-network-stack-dhcp-ipv4-config-local-proofas the done local DHCP/IPv4 config proof feedingcloud-prod-remote-session-web-ui-l4-local-proof, andcloud-gce-private-self-hosted-webui-proofwhile their remaining existing prerequisites land. Keep public ingress and TLS undercloud-gce-public-self-hosted-webui-ingress-tlsuntil separately authorized.
This decision unblocks 7c-ii(b) implementation. It preserves the fact that
7c-ii(a) was sequenced first because serving TcpListener/TcpSocket as
inter-process caps is useful under either architecture, and it now becomes the
direct serving substrate for the selected path.
Result-cap transfer needs no new ABI. The RETURN transfer-descriptor ABI
that lets a served method return a freshly-minted capability already exists: a
CAP_OP_RETURN carries xfer_cap_count transfer descriptors, the kernel inserts
them into the caller’s table and publishes CAP_CQE_TRANSFER_RESULT_CAPS
(kernel/src/cap/ring.rs::dispatch_return -> insert_prepared_transfer_caps ->
write_endpoint_return_result), and the client decodes the returned cap with
capos-rt’s CompletedCall::result_cap. The 7c-i UDP server already serves caps
over an exported Endpoint through exactly this RETURN path, so the
userspace-Endpoint-server side is proven; the kernel’s own
cap/network.rs::handle_accept (via insert_socket_result_cap) and the
telnet-gateway demo (listener.accept_wait; both since retired) proved the complementary
accept-returns-a-TcpSocket-cap shape and its client-consume side. What
7c-ii(a) adds is wiring – a userspace Endpoint server returning a TcpSocket
result-cap from TcpListener.accept – not a new ABI. Its hard part is the
smoltcp-pump-while-serving interleaving for a blocking multi-frame accept,
which is why 7c-iii deferred inter-process TCP serving.
Downstream self-hosted Web UI tasks consume slice 7 without making the Phase C exit cleanup a blocker for first GCE operator proof:
remote-session-self-served-full-ui-bundleis done and provides the reviewed boot-resource UI bundle.cloud-prod-remote-session-web-ui-l4-local-proofprovesremote-session-web-uiover the Phase C L4 path locally after the done DHCP/IPv4 configuration proof.cloud-gce-private-self-hosted-webui-proofproves the Web UI privately on GCE over the live NIC.cloud-gce-public-self-hosted-webui-ingress-tlsis the separate public ingress, TLS, and reviewed-auth posture step.
Network usability and post-smoltcp follow-ups are decomposed in
Network Usability and Post-smoltcp.
They do not change the Phase C critical path above: the local DHCP/IPv4
configuration proof is now done for the first GCE Web UI path, while the system
DnsResolver cap, POSIX getaddrinfo, ping/ping6 tools, packet tracing,
socket readiness policy, and transport tuning/status are follow-on usability or
diagnostics lanes.
Slices 7a-7c are smoltcp relocation: they run the selected smoltcp 0.13.0
build in userspace and preserve the socket contract, adding no new transport
mechanic. Transport policy/status — read-only transport status, keepalive/
timeout policy inputs, and the deferred congestion-control evaluation — is a
distinct control-plane lane decomposed in the backlog under
network-transport-policy-status-decomposition,
not part of the relocation slices. The userspace relocation track now has
landed UDP and TCP substrate proofs, and the TCP build still runs with
CongestionControl::None by build configuration; selecting Reno/CUBIC is a
build-feature flip, and any custom TCP mechanic requires separate workload
evidence and a reviewed task.
The Sustained-Receive Nic ABI (Former Prerequisite For 7c-iii)
7c-iii (TcpListener/TcpSocket) was blocked on the shape of the landed
Nic.receive. This section records the precise constraint, the
sustained-receive primitive that lifted it while keeping the settled DMA
isolation intact, and the ABI decision.
Status: landed. The Nic.receivePoll @4 method and the kernel-owned bounce
RX pool primitive designed below shipped in slice 7d
(cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof):
cap::virtio_net_userspace_rx_dma_proof::nic_receive_poll arms a pool of
NIC_RX_POOL_SIZE manager-owned bounce RX buffers and recycles them
individually (copy-out + scrub + slot-generation bump + re-post) with no
per-frame device reset; the receivePoll @4 dispatch arm lives in
cap::nic_grant_source_prod. The multi-frame proof
(make run-cloud-prod-nic-driver-userspace-sustained-receive-pool) drains more
than one frame with at least one non-resetting framePresent = false poll and
keeps the DMA-isolation assertions green (host_physical_user_visible = 0,
queue-address reads refused, quiesce + scrub at teardown). receive @1 is
unchanged. The rest of this section is the as-built design record.
The constraint, precisely (cite the real symbols)
The production Nic.receive @1 dispatches (under the frontier slice-6
cloud_virtio_net_userspace_clean_tx_rx_split_proof feature) to
virtio_net_userspace_rx_dma_proof::nic_receive_independent
(kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs:1437, dispatched from the
receive @1 arm in kernel/src/cap/nic_grant_source_prod.rs:309). Each call:
- allocates one fresh kernel bounce frame (
frame::alloc_frame_zeroed), - authors one RX descriptor + avail entry pointing at it and rings the RX doorbell,
- waits on the driver’s owned RX interrupt route
(
device_interrupt::wait_kernel_injected_dispatch) and retires the deferred LAPIC EOI, - polls the RX used ring for one completion, copies the frame out, and
- frees that bounce frame (
frame::try_free_frame).
Two properties make this single-frame primitive unable to serve TCP:
- No pool stays armed between calls. Exactly one RX buffer is posted per
call and freed when the call returns; between
receivecalls the device has no posted RX buffer to master into. A frame that arrives outside a call has nowhere to land. - No non-resetting “no frame yet”. On a successful frame the independent
path keeps the device live (it does not reset – success arm at
virtio_net_userspace_rx_dma_proof.rs:1535), but an empty/timeout poll (RxDmaFailure::UsedRingPollExhausted) takes the error arm and quiesces the device (nic_quiesce_device:reset_device+ assert queues cleared + release pins). The predecessor proof pathdrive_rx_dma(virtio_net_userspace_rx_dma_proof.rs:331) resets on every outcome; the independent path narrowed that to reset-on-empty-poll, but the reset-on-empty remains. So a speculative “is there a frame?” poll with nothing waiting tears the device down.
smoltcp drives the opposite shape: its poll() loop calls the phy::Device
RX token speculatively and frequently, expecting “no frame yet” to be a cheap,
side-effect-free answer against a device that stays armed with multiple
posted buffers, so the asynchronous, multi-frame TCP exchange (SYN-ACK, data
segments, ACKs, retransmits arriving whenever the peer chooses, sometimes
several in quick succession) can be drained as frames arrive. That is why
7c-iii needed the since-landed sustained-receive ABI before the remaining
7c-ii(b) final-wiring task could stay on hold solely for the operator
architecture decision.
The reset-on-empty is not a bug: it is part of the settled DMA-isolation
model (docs/dma-isolation-design.md). Frames cross through manager-owned
bounce pages; userspace never sees a host-physical or device-usable address;
and after a buffer is reclaimed the device must be proven to have stopped
mastering it (“if in-flight DMA cannot be proven stopped, revocation escalates
to device reset”, docs/dma-isolation-design.md DMAPool Invariants ->
Reset). The single-frame path proves that the crude way – reset the whole
device. The design problem is to keep the isolation guarantee without
resetting the device on every empty poll or every reclaimed buffer.
The sustained-receive primitive
Keep the device armed with a kernel-owned bounce RX pool of N buffers and
recycle buffers individually instead of resetting the device:
- Arm. At first use (or at cap setup) the kernel allocates
Nmanager-owned bounce RX buffers from the driver’s grantedDMAPooland posts allNto the RX vring avail ring. The device masters only into these kernel-owned bounce pages; userspace still receives no host-physical or device-usable address (host_physical_user_visible = 0), exactly as the single-frame path proved. - Drain one arrived frame (per poll). The kernel reads the RX used ring. If
used.idxadvanced, a frame landed in bounce slotk; the kernel treats slotkas device-written untrusted input (docs/dma-isolation-design.md“Receive buffers are treated as device-written untrusted input until validated by the driver or stack”), copies the frame bytes out into the inlineDatareply bounded by the posted buffer length, then recycles slotk. - No frame yet. If
used.idxdid not advance, the call returns “no frame” with no reset and the device stays armed. This is the cheap speculative pollsmoltcpneeds. - Teardown.
on_release(and any unprovable-in-flight-DMA error) still quiesces:reset_device, assert both queues cleared, scrub the whole pool, release the enabled-vring pins, and drop the pool – identical to the existingnic_quiescediscipline. Reset remains the escalation path; it is simply no longer the per-frame path.
The per-buffer invariant that replaces “reset before reclaim”: a bounce
slot is re-exposed to the device only after its copy-out completes and its slot
ownership generation is bumped, with the slot scrubbed before the re-post. This
is the production handle-epoch slot identity (slot + slot_generation,
docs/dma-isolation-design.md Production Handle Epoch Invariants and
DMAPool Invariants: “Buffer operations additionally check the buffer slot and
slot generation before descriptor validation, completion accounting, free,
scrub, or reuse”) applied at buffer-recycle granularity instead of
device-reset granularity:
- the device wrote slot
kand signalled completion via the used ring (it is no longer mastering that slot – the per-buffer analogue of “in-flight DMA is proven stopped”); - the kernel copies the bytes out;
- the kernel scrubs slot
k(residual-state rule,docs/dma-isolation-design.mdResidual state) and bumps its slot generation, so a stale descriptor, free, or completion for the prior occupant fails closed; - only then does the kernel re-post slot
kto the avail ring (re-arm).
As built, the pool buffers are kernel-private frame::alloc_frame_zeroed pages
(never a userspace DMABuffer handle), so the slice does not go through the
single-frame device_dma begin_rx_dma_drive_pin drive-pin code path (that pin
guards a userspace-submitted DMABuffer’s live-unmapped re-validation, which
has no analogue for a manager-private frame). Instead the slice applies the
same per-buffer slot-identity discipline modeled on the production
handle-epoch slot identity (slot + slot_generation,
docs/dma-isolation-design.md) as kernel bookkeeping local to the pool: the
device masters only into kernel-owned pages, each slot is scrubbed and its
generation bumped before re-exposure, and teardown still quiesces (reset) before
reclaim. No new isolation backend, no new IOVA-export rule, no host-physical or
device-usable address exported (host_physical_user_visible = 0).
ABI decision: extend Nic with a non-resetting poll receive (option a)
Chosen: option (a) – add a non-resetting poll-receive method to the Nic
schema and keep receive @1 as the legacy single-shot:
interface Nic {
transmit @0 (frame :Data) -> (result, reason, sideEffect);
receive @1 () -> (frame :Data, observedEthertype :UInt16,
result, reason, sideEffect); # legacy single-shot
macAddress @2 () -> (addr :Data, result, reason, sideEffect);
linkStatus @3 () -> (up :Bool, result, reason, sideEffect);
# Sustained, non-resetting receive over the armed kernel-owned bounce RX
# pool. Returns the next arrived frame, or framePresent = false with no
# device reset when none has arrived. The device stays armed.
receivePoll @4 () -> (frame :Data, observedEthertype :UInt16,
framePresent :Bool,
result :Text, reason :Text, sideEffect :Text);
}
How receivePoll @4 reports its two outcomes without a reset:
| Outcome | framePresent | frame | result / reason / sideEffect |
|---|---|---|---|
| frame arrived | true | frame bytes inline | ok / frame-received / buffer-recycled (slot copied-out, generation bumped, re-posted) |
| no frame yet | false | empty | ok / no-frame / device-armed (no reset; pool still posted) |
| fail closed | false | empty | failed / <reason> / device quiesced (escalation only, not the empty-poll path) |
receive @1 semantics are unchanged (it still resets on empty poll), so the
7b UDP request/response proof stays green; only the new method is non-resetting.
Why not option (b) (in-stack sustained-receive loop against a kernel
primitive). Option (b) – the network-stack process drives a bounded
sustained-receive loop against a kernel primitive directly – was rejected
because it still requires the identical new kernel RX-pool machinery (the design
above) and drives it outside the typed Nic cap boundary, fragmenting the
device-facing authority the whole track funnels through one Nic cap (slices
4b-7). Option (a) keeps the network-stack process holding exactly the Nic
cap it already holds, maps the new method one-to-one onto the smoltcp
phy::Device RX token (“give me the next arrived frame or nothing”), and is the
direct extension of the inline-Data receive ABI the track already accepted.
The cost of (a) is bounded and already named in the abi hazard: one new schema
method, a make generated-code-check regeneration of the checked-in capnp
bindings, and updating every exhaustive Nic method match (receivePoll @4
arm in kernel/src/cap/nic_grant_source_prod.rs; the NicClient in
capos-rt). A receiveBatch returning List(Data) was considered and deferred:
the smoltcp RX token consumes one frame at a time, so single-frame poll is the
clean match and batching is a separately-justified later optimization.
Design Grounding
docs/proposals/network-reachable-datapath-scope-decision.md– the parent scope decision (Option A) that opened this track.docs/proposals/networking-proposal.md, Part 3 (Phase C) – the accepted decomposition, its prerequisites table and exit criteria, and theNicdraft this doc adopts (inlineData).docs/dma-isolation-design.mdS.11.2 – the DMA-isolation invariants and the driver-transition gate, already satisfied by the landed bounce / IOMMU-IOVA track that slice 2 reuses.kernel/src/cap/device_mmio.rs(notify_doorbell),kernel/src/cap/dma_buffer.rs(export labels),kernel/src/cap/virtio_net_polled_provider.rs(kernel-owned vring) – the primitives whose current limits define the cap-surface gap.