Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase C: Userspace virtio-net Driver Relocation

This is the L4 track opened by the Network-Reachable Datapath Scope Decision (Option A): raw-frame TX/RX reachability is the cloud milestone bar, and the L4 socket path – relocating smoltcp and the cap/network.rs socket caps out of the cfg(qemu) kernel/src/virtio.rs into userspace processes (networking-proposal Part 3, Phase C) – is a separate future track. This doc designs that track and sequences its slices.

The first Phase C Web UI path remains IPv4-scoped: userspace L4 plus DHCP/IPv4 configuration, ARP, and the private/public GCE Web UI proofs. IPv6 is tracked as a separate network-stack capability lane in Networking and the hardware/cloud backlog. Phase C must preserve enough address-family shape for that lane, but lack of IPv6 does not block the first IPv4 GCE Web UI proof.

Cap-Surface Delta: What The Userspace Driver Needs

The current DeviceMmio / DMAPool / Interrupt cap surface does not yet host the virtio-net driver in userspace as built, but the missing pieces are bounded extensions of accepted patterns and a reuse of the landed production DMA-isolation track – not new isolation built from scratch. Per-primitive evidence:

  • DeviceMmio gives a read-only BAR window with one selected write today. DeviceMmio.map returns a read-only BAR page; raw write32 is refused (register_write = "blocked"), with exactly one selected write permitted – the notify doorbell at @5 (notify_doorbell, kernel/src/cap/device_mmio.rs). A driver must additionally write the virtio common-config window (device status, feature-select/feature, queue-select, queue-size, queue-address/queue-enable). The relocation adds these as further selected writes under the same accepted range-check + read-back discipline the notify doorbell already enforces (see “The Common-Config Window” below) – not a new write primitive.
  • DMAPool does not yet export a device-usable address to this driver, but the export discipline is landed. DMAPool gives one bounce page; the host-physical / device address is not exported in the bounce posture (host_physical_user_visible = false, direct_dma = "blocked", iova_export = "disabled-future-only", kernel/src/cap/dma_buffer.rs). The vring is kernel-owned today (kernel/src/cap/virtio_net_polled_provider.rs), so userspace does not yet place its own descriptors. The mechanism to let it do so safely – a manager-owned bounce buffer or a domain-scoped IOMMU IOVA, never a raw host-physical address – is already landed (the production DMA-isolation track; see “The Userspace-Ownable vring Slice” below); the slice-2 work wires it to the driver’s vring.
  • Interrupt is wait-only over a kernel-latched used ring. Interrupt.wait reads a kernel-latched used-ring index, acknowledge is a no-op, and mask/unmask are refused. A real driver owns its IRQ lifecycle (mask, unmask, EOI ordering).

Classifying the virtio-net bring-up steps against what userspace can do today makes the gap concrete – almost every step is kernel-only:

Bring-up stepUserspace-doable today?
Device reset (write status = 0)No – needs writable status register
ACKNOWLEDGE / DRIVER status bitsNo – needs writable status register
Feature negotiate (select + read + write + FEATURES_OK)No – needs writable feature-select/feature + status
Queue program (queue-select, queue-size, queue-address, queue-enable)No – needs writable common-config + a device-usable vring address
vring allocation (avail/used/descriptor tables)No – vring is kernel-owned; no device-usable buffer address export
DRIVER_OK (write status)No – needs writable status register
MSI-X program / vector assignmentNo – kernel-owned
Submit + notify (ring doorbell)Partial – the one selected doorbell write @5 exists
Poll used ringPartial – via the kernel-latched index Interrupt.wait reads
Teardown (reset, scrub, release)No – kernel-owned reset path

The Nic ABI (Inline Data, Per the Proposal Draft)

This track keeps the networking-proposal Part 3 frame ABI: frames cross the cap boundary as inline Data (transmit @0 (frame :Data), receive @1 () -> (frame :Data), networking-proposal:443). The kernel copies the frame into and out of the manager-owned bounce buffer the polled provider already established, so no host-physical address or device-usable buffer handle is exported to userspace and host_physical_user_visible=0 is preserved. capOS method convention adds the result/reason/sideEffect evidence triple and the observed EtherType to the result:

interface Nic {
    transmit @0 (frame :Data)
        -> (result :Text, reason :Text, sideEffect :Text);
    receive  @1 ()
        -> (frame :Data, observedEthertype :UInt16,
            result :Text, reason :Text, sideEffect :Text);
    macAddress @2 () -> (addr :Data);
    linkStatus @3 () -> (up :Bool);
}

Why inline Data, not a zero-copy buffer handle. A DmaBuffer-handle (zero-copy) ABI was considered – it would avoid the per-frame copy – but rejected to keep the change small: it introduces a new buffer-ownership protocol across the cap boundary (who allocates, who frees, lifetime versus the call) on top of the security work this track already requires, for a copy cost that does not matter at research scale. Inline Data matches the accepted proposal draft, keeps the frame staging kernel-owned exactly as the polled provider proved, and defers any zero-copy optimization to a later, separately justified slice.

The Common-Config Window (Selected-Write, No New Ruling)

Relocation writes the virtio common-config window from userspace through a bounded, range-checked, selected-write path modeled on the existing single selected write (notify_doorbell @5, kernel/src/cap/device_mmio.rs; device_manager::provider_notify_doorbell_write_for_cap). This is the next register in the accepted selected-write pattern, not a new security relaxation requiring a ruling. DeviceMmio already refuses raw write32 (register_write = "blocked") and admits exactly the claimed selected write, range-checked against the decoded BAR and followed by a kernel-asserted read-back; the handshake registers are added to that same admission list under the same discipline.

The bounded design:

  • A selected-write common-config window: only the named virtio common-config registers needed for handshake (device status, feature-select, device-feature, driver-feature) are writable in slice 1, each range-checked against the claimed BAR.
  • Read-back-assertion discipline: every selected write is followed by a read-back the kernel asserts, so a userspace driver cannot leave the device in an unverified state.
  • Queue-address registers stay fail-closed in slice 1 and are admitted to the same selected-write list only in slice 2, where each programmed value must resolve to a device-usable address the writing driver was granted (the DMA-isolation discipline below decides), so userspace can never point the device at arbitrary physical memory.

No new ruling is pending: the project already decided this posture through the accepted selected-write discipline and the IOVA-export discipline below. Slice 1 is ready.

The Userspace-Ownable vring Slice (Reuses Landed DMA Isolation)

The expensive-sounding piece – slice 2, a userspace-ownable vring plus a device-usable buffer-address export – is wiring already-landed isolation to the driver’s vring, not building isolation from scratch. The two backends and the no-host-physical export discipline are all landed:

  • Bounce-buffer path (production default on no-IOMMU shapes). The runtime DMA-backend probe (kernel/src/dma_backend.rs, select_and_report / probe_verified_usable_iommu) selects the labeled bounce-buffer fallback fail-closed when no usable guest IOMMU is verified, and the manager-owned bounce-buffer DMAPool / DMABuffer lifecycle (kernel/src/device_dma.rs, scrub-before-free, owner/slot generations, quiesce-before-release) is the landed authority a driver uses for device-visible buffer memory (cloud-prod-dmapool-bounce-buffer-grant-proof).
  • IOMMU-IOVA path (graduates when the probe verifies usable hardware). The Intel VT-d remapping path (kernel/src/iommu.rs, cfg(qemu) today) programs per-device domains, maps manager-owned DMAPool pages, and exports only a domain-scoped IOVA – never a host-physical address (ddf-iommu-remapping-production-closeout, ddf-iommu-production-dmapool-ledger-integration, ddf-iommu-per-device-domain-granularity, ddf-iommu-production-revoke-teardown-hostile-smokes, ddf-real-dma-iommu-direct-path).
  • No-host-physical export discipline. The IOVA-export discipline (ddf-iommu-iova-export-discipline) and the host_physical_user_visible = false / iova_export = "disabled-future-only" posture (kernel/src/cap/dma_buffer.rs) guarantee a driver receives only a device-usable address (bounce handle or domain-scoped IOVA), never a raw host physical address.

The accepted contract is docs/dma-isolation-design.md (“Cloud DMA Backend” runtime-selection rule and the IOVA-export-discipline clause), and the S.11.2 hostile-smoke matrix is already enforced for both backends. The remaining slice-2 work is therefore to let the userspace driver allocate its vring through the granted DMAPool (bounce, or IOVA-backed when the probe verifies usable hardware), learn the device-usable address for each ring, and program those addresses into the queue-address registers over the slice-1 writable window – under the landed fail-closed / scrub / quiesce / revoke discipline. The networking-proposal Phase C prerequisites table and S.11.2 are satisfied by the landed track, not deferred to it.

Bounded Slice Chain

  1. Userspace status + feature handshake to FEATURES_OK over a writable selected-write common-config window (the next register in the accepted notify-doorbell selected-write discipline). [DONE 2026-06-02.] The cap::devicemmio_grant_source_prod source stages the virtio-net common-config window as a writable selected-write DeviceMmio grant (stage_virtio_net_common_config); the userspace shim drives the handshake over DeviceMmio.read32/write32, the write admission (device_manager::stub::write_devicemmio_u32) admits only the four handshake registers (range-checked + read-back-asserted) and refuses queue-address writes; the unimplemented Nic stub is in schema/capos.capnp. Proof make run-cloud-prod-nic-driver-userspace-features-ok. Task record: docs/tasks/done/2026-06-02/cloud-prod-nic-driver-userspace-features-ok-local-proof.md.
  2. Userspace-ownable vring + device-usable address export – reuses the landed production DMA isolation (bounce policy + dma_backend probe + IOMMU IOVA-export, S.11.2 already enforced); the work is wiring it to the driver’s vring. [DONE 2026-06-03.] Under cloud_virtio_net_userspace_ownable_vring_proof (implies slice 1) the userspace shim co-receives the writable common-config DeviceMmio grant and a bounce-buffer DMAPool grant on the same virtio-net function; it allocates its descriptor / available / used ring pages, learns each buffer’s opaque device-usable handle from DMABuffer.info (deviceIova, scope bounce-handle), and programs queue_desc / queue_driver / queue_device over the slice-1 window. device_manager::stub::write_devicemmio_u32 (admit_virtio_queue_address_write) resolves each handle against the live DMAPool grant ledger (resolve_virtio_vring_device_address) to the real bounce host-physical address, programs that address (never the handle), and read-back-asserts; queue-address reads (0x20..0x38) are refused so the host-physical address is never exposed, and out-of-grant / host-physical / stale-generation writes fail closed. queue_enable stays fail-closed. Proof make run-cloud-prod-nic-driver-userspace-ownable-vring. Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-ownable-vring-local-proof.md.
  3. Userspace queue-program + DRIVER_OK over the (now device-addressable) vring. [DONE 2026-06-03.] Under cloud_virtio_net_userspace_queue_enable_driver_ok_proof (implies slice 2) the userspace shim completes device bring-up: after slice 2’s queue-address programming it writes queue_enable = 1 (0x1c) for its programmed TX queue and sets DRIVER_OK over the already-writable device-status register. device_manager::stub::write_devicemmio_u32 admits the queue_enable write only when the active queue’s vring memory is live and page-fitting (selected_queue_ready_to_enable): it reads the active queue_desc/queue_driver/queue_device back kernel-side and requires each to currently hold the host-physical address of a live granted DMABuffer (a freed buffer’s stale address cannot arm a use-after-free DMA target), and requires the active queue_size to fit every split-ring structure inside one granted bounce page; an enable of an unprogrammed, freed, or oversized queue fails closed, and the enable is read-back-asserted. Once enabled, the queue’s vring base registers are immutable – a queue-address repoint is refused so the driver cannot mutate the vring under a running device. The DRIVER_OK device-status write is kernel-asserted: the kernel re-reads device-status and fails closed unless the device latched the ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK byte exactly (rejecting FAILED and DEVICE_NEEDS_RESET). Queue-address reads stay refused; no host-physical is exposed; no new DMA isolation backend. Proof make run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok. Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-queue-enable-driver-ok-local-proof.md. 4a. Userspace RX queue 0 bring-up + buffer-identity binding + ring-buffer pinning, then the first real RX DMA from the shim-owned vring. Split into two landed/ready sub-slices because the real-DMA hybrid bring-up is large:
    • 4a-i [DONE 2026-06-03]. The shim brings up RX queue 0 over its own vring (slices 1-3 brought up only the TX queue; the queue_enable admission is queue-agnostic). device_manager::stub retains each programmed queue’s vring physes + originating DMABuffer handle identity on ProductionDeviceRecord (admit_virtio_queue_address_write), binds queue_enable to that identity (a freed buffer’s stale handle, or a freed-then-reallocated frame at the same host-physical address, fails closed with devicemmio-queue-enable-identity-mismatch), and pins the ring buffers against freeBuffer / process-teardown release while the queue is enabled (dmabuffer-pinned-enabled-vring), releasing only on disable/reset with quiesce. This completes the vring buffer-lifetime binding slice 3 left point-in-time at the bring-up boundary. No device DMA. Proof make run-cloud-prod-nic-driver-userspace-rx-bringup. Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-bringup-and-buffer-pinning-local-proof.md.
    • 4a-ii [DONE 2026-06-03]. The first real RX DMA from the shim vring: the shim also brings up TX queue 1 over its own vring, posts one device-writable RX receive buffer on queue 0 (DMABuffer.submitDescriptor), and rings the production DeviceMmio.notifyDoorbell @5 (the previously Err(stale_handle) provider_notify_doorbell_write_for_cap, now live; cap::devicemmio_grant_source_prod maps the notify region kernel-side and captures the per-queue notify slot offsets). The kernel (cap::virtio_net_userspace_rx_dma_proof) drives the RX publish over the shim’s retained RX physes + a kernel-half SLIRP TX ARP stimulus over the shim’s retained TX physes + one real device->host RX DMA (used_len > 0, observed EtherType 0x0806), latches the used-ring index (int_injected = 0, no Interrupt cap), and resets the device – quiescing the queues and releasing the ring-buffer pins – WITHOUT the Nic cap. Proof make run-cloud-prod-nic-driver-userspace-rx-bringup (extended). The deterministic freed-then-reallocated-frame identity negative is split to a follow-up: the next-fit frame allocator (capos-lib FrameBitmap, free_frame does not rewind next_hint) never returns a just-freed frame on the next allocation, so a deterministic same-phys realloc – needed to reach the slice-4a identity gate rather than the slice-3 phys gate – requires an allocator reuse seam. The data path is cooperative-shim-safe (kernel authors the descriptor + avail inside the drive window, re-validates the posted buffer live + unmapped before publishing, resets on every post-publish path); the HOSTILE-shim residuals (kernel-exclusive/unmappable enabled vring rings + reset-failure payload quarantine) are closed in 4a-iv; the identity negative is closed in 4a-iii. Task records: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-local-proof.md; follow-ups docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md, docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md.
    • 4a-iv [DONE 2026-06-03]. HOSTILE-shim hardening over the 4a-ii data path. The shim owns its vring ring buffers (granted bounce DMABuffers it can DMABuffer.map), so the kernel cannot trust ring contents it did not author and lock. The closed gaps: (1) a programmed/enabled vring ring buffer is made kernel-exclusivevalidate_dmabuffer_map_admission refuses a NEW DMABuffer.map of it while the queue is enabled, selected_queue_identity_bound refuses queue_enable while a ring buffer still carries a live user mapping (a kept pre-enable VMA) and refuses arming a queue whose ring buffers are not pairwise-distinct pages – both within the queue and across the device’s other enabled queues (an aliased desc/driver/device, or rx.desc == tx.desc, would let the kernel-authored ring writes corrupt a descriptor into a non-bounce DMA target), and record_dmabuffer_user_mapping refuses to record a mapping on a programmed/enabled ring (an in-flight SMP map), and the RX-DMA submit/drive admission refuses to publish an enabled ring buffer as the RX DMA payload (the shim cannot point the device at its own ring page); (2) at queue_enable the kernel WIPES the queue’s descriptor table slot 0, available ring, AND used ring (virtio_net_userspace_rx_dma_proof::sanitize_enabled_queue_rings), so a shim that pre-wrote avail.idx / a tampered descriptor / a spoofed used.idx while the queue was disabled cannot pre-publish it into the enabled window – the device sees an empty queue with no pre-staged completion until the kernel-authored drive publishes and a real device DMA advances used.idx; (3) a per-bdf RX-DMA payload drive pin / reset-failure quarantine (device_manager::stub begin_rx_dma_drive_pin / clear_rx_dma_drive_pin / mark_rx_dma_payload_quarantine_permanent, consulted by the map admission + the record path + the freeBuffer/teardown detach) is set atomically with the live+unmapped re-validation under the device-table lock for the drive duration, cleared after a confirmed reset, and promoted to a permanent quarantine on the catastrophic reset-failure path (never downgraded by a later drive). The smoke (make run-cloud-prod-nic-driver-userspace-rx-bringup, extended) proves a still-mapped ring blocks queue_enable, the post-enable map refusal on the descriptor + driver rings, and that a hostile pre-enable descriptor/avail.idx tamper does NOT survive (RX DMA still completes with used_id=0, ARP EtherType). The drive pin’s SMP map/free race and the reset-failure quarantine are structural fail-closed hardening: the single-CPU cloudboot proof cannot reach the race and QEMU virtio reset always succeeds, so they are not separately QEMU-observable. Residual [CLOSED 2026-06-03 in 4a-v]: two SMP-only, bounce-confined races between DMABuffer.map and a device-authority state transition – the cap-side map_page_into_user installed the user PTE before the manager-side record (a manager-rejected map left a transient provisional PTE), and the queue_enable no-live-mapping check was not atomic with the retained enabled flag flip (a concurrent map could record a mapping in the enable window) – were not reachable by the single-CPU proof and are closed by slice 4a-v below. Predecessor task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md.
    • 4a-iii [DONE 2026-06-03]. The deterministic freed-then-reallocated-frame queue_enable identity negative carved out of 4a-ii. A proof-only one-shot bounce-frame reuse seam (mem::frame::proof_try_alloc_specific_frame_zeroed consumed by a device_dma reuse hint armed on each bounce free, both gated behind cloud_virtio_net_userspace_rx_bringup_proof, never compiled into production) makes a same-host-physical DMABuffer realloc reachable from the userspace harness despite the production next-fit FrameBitmap. The smoke programs a transient ring buffer into rx queue_desc, frees it, relands a fresh-handle buffer on the same frame, and proves queue_enable fails closed on the slice-4a identity gate (authority_result = devicemmio-queue-enable-identity-mismatch) – the recorded phys is live again so the slice-3 phys gate passes, and the marker flips identity_realloc_negative=enforced. The seam does not relax the next-fit policy or the identity gate. Proof make run-cloud-prod-nic-driver-userspace-rx-bringup (extended). Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md.
    • 4a-v [DONE 2026-06-03]. Closes the two SMP-only, bounce-confined DMABuffer.map-vs-device-authority races split out of 4a-iv. (1) kernel/src/cap/dma_buffer.rs::map_page_into_user now takes the manager-side mapping record (record_dmabuffer_user_mapping, which acquires the device-table lock) BEFORE installing the user PTE, with the caller address-space lock held across the manager call (lock order address-space -> device-table -> quarantine, no reverse nesting): a manager-rejected map returns before any PTE is installed, so a concurrent SMP thread can never touch a transient provisional mapping at the deterministic auto-picked base. (2) kernel/src/device_manager/stub.rs::try_enable_selected_queue folds the queue_enable no-live-mapping identity check and a retained enable_pinned commit pin into one device-table critical section, set before the MMIO arm write and rolled back on a readback mismatch, so a concurrent map cannot record a mapping in the enable window. The commit pin is kept distinct from the device-armed enabled bit (set after the MMIO write): map admission / freeBuffer pinning consult enabled || enable_pinned, while the RX-DMA drive gates on enabled alone, so a concurrent drive cannot run – and reset-clear the retained state – inside the arm window. A per-queue enable_in_progress claim serializes concurrent queue_enable transitions (enable vs enable, enable vs disable) so a racing queue_enable = 0 can never clear an in-flight enable’s commit pin, and the drive’s reset cleanup skips clearing while a transition is in flight so a stale drive cannot clear a newer enable’s pin. The two transitions now serialize on the device-table lock: one fails closed. The interleavings are not single-CPU-reachable and the live kernel statics are not host-testable, so the proof is an exhaustive Loom interleaving model (capos-config/tests/dmabuffer_map_enable_loom.rs, run via cargo test-dmabuffer-map-enable-loom) that asserts no schedule arms a queue with a live user PTE on its ring buffer or installs a PTE without an accepted record; the single-CPU make run-cloud-prod-nic-driver-userspace-rx-bringup regression continues to pass unchanged. Residual [CLOSED 2026-06-03]: the follow-up fenced the drive’s post-reset cleanup on a per-queue transition generation. RetainedVringQueue::generation is bumped on each completed transition (mark_selected_queue_armed enable arm + finish_selected_queue_disable disable finish); the drive reads it (retained_queue_generation) IMMEDIATELY before reset_device – the reset boundary, not the far-earlier enabled sample – and the cleanup (mark_retained_vring_queue_disabled_if_epoch) clears the pins only while the queue is still in the epoch the reset quiesced. A disable + re-enable that fully completes AFTER the reset advances the generation, so the cleanup becomes a no-op instead of clearing the freshly armed epoch’s pins; one that completed BEFORE the reset is included in the captured generation, so its pins clear (no stale over-pin). The only residual is a transition completing in the tiny read->reset-MMIO gap whose arm loses the race: a fail-closed over-pin, never an under-pin. The Loom model gained fenced_stale_drive_cannot_clear_a_completed_re_enables_pins, which fails with the fence removed (GENERATION_FENCE = false). Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-queue-enable-drive-reset-epoch-fence-local-proof.md. Predecessor (4a-v): docs/tasks/done/2026-06-03/dmabuffer-map-record-before-pte-install-ordering-local-proof.md. 4b. Nic-cap driver process, coupled TX/RX round-trip [DONE 2026-06-03]. Implements the slice-1 Nic interface stub as a live CapObject (kernel/src/cap/nic_grant_source_prod.rs, granted via the new nic KernelCapSource registered in capos-config; client NicClient in capos-rt). Over the SAME shim-brought-up device (RX queue 0 + TX queue 1 enabled by slices 1-3 + 4a), the cap drives the shim’s retained vring physes through virtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive, nic_quiesce}: receive() internally drives the coupled ARP-TX-stimulus + RX-poll and returns the received frame inline plus the observed EtherType; transmit() stages a frame into a manager-owned TX bounce page over the retained TX vring and rings the doorbell; macAddress()/linkStatus() read the kernel-mapped virtio-net device-config region. Frames cross the cap boundary as inline Data copied through manager-owned kernel bounce pages (host_physical_user_visible = 0; no host-physical / device-handle exposure); the device is left live for the cap’s lifetime and quiesced once on cap release (nic_quiesce: reset + queues-cleared + release the enabled-vring pins). Completion stays kernel-latched used-ring polled (int_injected = 0, no Interrupt cap). The clean independent TX/RX split is deferred to slice 6; userspace IRQ ownership to slice 5. Proof make run-cloud-prod-nic-driver-userspace-nic-cap-roundtrip boots the device from userspace, round-trips two sequential frames through the typed Nic cap (observed EtherType 0x0806 over QEMU SLIRP), and emits one cloudboot-evidence: nic-driver-userspace-nic-cap-roundtrip <token> marker with roundtrips=2. The same proof releases the parent DMAPool cap and one pinned ring DMABuffer cap before Nic release, then shows Nic quiesce replaying the blocked buffer detach and the pending parent pool detach completing after the remaining ring buffers are freed. Depends on 4a-ii (the shim-owned-vring data path). Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-nic-cap-roundtrip-local-proof.md.
  4. Userspace IRQ ownership [DONE 2026-06-03]. The userspace NIC driver’s Interrupt cap for its device RX route becomes real, replacing slice-4b’s kernel-latched used-ring polled completion (int_injected = 0). Under cloud_virtio_net_userspace_irq_ownership_proof (implies slice 4b) a new Interrupt grant source (kernel/src/cap/virtio_net_userspace_irq_ownership_proof.rs, replacing the admission-only interrupt_grant_source_prod source via the KernelCapSource::Interrupt arm) programs the staged virtio-net function’s RX MSI-X route (entry 0) mask-first through the landed always-built cap::interrupt_programmed::program_attach_arm_unmask and issues a cap whose: wait blocks on a real interrupt dispatch through the route’s MSI-X / LAPIC dispatch slot (device_interrupt::wait_kernel_injected_dispatch; delivery_count advances, so int_injected flips from 0 – slice 4b had no Interrupt cap on the data path at all). The wake is a bounded kernel-injected dispatch through the route’s real deferred-EOI machinery, not yet a device-autonomous MSI-X raise causally tied to a specific frame; Nic.receive still reads the frame bytes from the used ring, so this slice delivers IRQ-lifecycle ownership (the driver drives the real wait/acknowledge/ mask/unmask), not interrupt-coalesced RX completion. acknowledge retires exactly one deferred LAPIC EOI through device_interrupt::acknowledge_deferred_lapic_eoi_for_route (hardwareDispatchAckDelta = 1); and mask/unmask toggle the route’s own MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2) plus the manager-attached route state, through pci::set_msix_table_entry_mask + device_interrupt::{mask,unmask}_device_manager_attached_route. The route is torn down (interrupt_programmed::teardown) on cap release. The driver holds this Interrupt cap alongside the slice-4b Nic/DeviceMmio/DMAPool caps on the same function; it brings the device up from userspace, then drives the owned RX-interrupt lifecycle and reads the completed frame back through Nic.receive. The PCI function-level MSI-X enable bit is not toggled and no device-autonomous raise is attempted (device_autonomous_raise=not-attempted, waiter_wake=kernel-injected-dispatch); the landed DMA isolation, the slice-2/3 vring grants, and the buffer-identity / ring-buffer pinning are reused unchanged (host_physical_user_visible = 0, queue-address reads still refused). No new Interrupt interface or method (the existing wait/acknowledge/mask/ unmask become real for this route). Proof make run-cloud-prod-nic-driver-userspace-irq-ownership emits one cloudboot-evidence: nic-driver-userspace-irq-ownership <token> marker. Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-irq-ownership-local-proof.md.
  5. Clean TX/RX split [DONE 2026-06-03]. Decouples the coupled receive path into independent TX and RX submission. Under cloud_virtio_net_userspace_clean_tx_rx_split_proof (implies slice 5) the Nic cap’s receive @1 dispatches to the new virtio_net_userspace_rx_dma_proof::nic_receive_independent instead of the coupled nic_receive: it posts a manager-owned RX receive buffer on the retained RX vring, waits on the driver’s OWNED RX interrupt route (the slice-5 device_interrupt::wait_kernel_injected_dispatch dispatch slot, resolved through virtio_net_userspace_irq_ownership_proof::owned_rx_route), retires the deferred LAPIC EOI, and reads the completed frame from the RX used ring – with no internal ARP-TX self-stimulus (it never submits to the TX vring). The RX frame is driven by an external stimulus: the consumer’s preceding independent Nic.transmit of a real broadcast ARP request, which QEMU SLIRP answers; the inbound reply is held in the host net queue until the RX buffer is posted. Nic.transmit stays independent (submits the caller’s frame and rings the TX doorbell with no RX poll; surfaces rx_polls=0). The wake stays the bounded kernel-injected dispatch slice 5 owns (waiter_wake=kernel-injected-dispatch, device_autonomous_raise=not-attempted). Reuses the landed owned-vring / owned-IRQ / DMA-isolation unchanged: no new selected-write register, no new MSI-X surface, no new Nic/Interrupt method (make generated-code-check green), no host-physical / handle exposure (host_physical_user_visible = 0, queue-address reads refused). The driver does an independent transmit then a separate independent receive, neither performing the other’s submission (tx_independent=ok, rx_independent=ok, receive_self_stimulus=removed). Proof make run-cloud-prod-nic-driver-userspace-clean-tx-rx-split emits one cloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token> marker. Task record: cloud-prod-nic-driver-userspace-clean-tx-rx-split-local-proof.
  6. Network-stack process + smoltcp relocation – a second userspace process holding the Nic cap and a bounded time source, running smoltcp, implementing the socket caps while preserving the cap/network.rs contract. Slice 6 is now done, so this slice is decomposed into bounded increments rather than attempted as one step:
    • 7a (first increment, DONE 2026-06-03): network-stack-process skeleton. A userspace process runs a minimal smoltcp Interface over a phy::Device adapter backed by the landed independent Nic.transmit/Nic.receive (slice 6), clocked by a Timer cap, and drives one observable Ethernet exchange through SLIRP: smoltcp – not hand-rolled frame code – ARPs the gateway out through Nic.transmit, consumes the reply in through Nic.receive, and emits the queued IPv4/UDP datagram, so the neighbour cache observably advances. No socket caps, no cap/network.rs relocation, virtio_stub.rs unchanged. Proof make run-cloud-prod-network-stack-process-smoltcp-skeleton. Implementation note: the landed Nic cap is not yet self-sufficient – its transmit/receive ride on the userspace driver shim’s retained vring (the kernel does not own the vring), so the skeleton process performs the slice-1-6 bring-up itself (it also holds the DeviceMmio/DMAPool/Interrupt caps) before running smoltcp. Splitting the bring-up into a separate long-lived NIC-driver service, so the network-stack process holds only Nic + Timer + Console, is folded into the 7c contract-relocation increment; it does not change the proven smoltcp-substrate claim. Task record: cloud-prod-network-stack-process-smoltcp-skeleton-local-proof.
    • 7b (socket layer, DONE 2026-06-03): socket caps over the userspace smoltcp stack – a userspace UdpSocket cap layer (UdpSocketCapLayer) implements the UdpSocket schema’s sendTo/recvFrom semantics over the 7a Interface and proves one bounded UDP request/response: a DNS A query for example.com to SLIRP’s resolver at 10.0.2.3:53 via sendTo, then the decoded response via recvFrom. smoltcp drives every frame through the Nic cap (ARP reply + DNS reply both fetched through Nic.receive, host_physical_user_visible = 0 preserved); Timer clocks the poll. The socket layer is in-process – it does not yet serve the socket interfaces as inter-process transferable capabilities, and it does not touch cap/network.rs (virtio_stub.rs stays fail-closed). Proof make run-cloud-prod-network-stack-smoltcp-socket-caps (one cloudboot-evidence: network-stack-smoltcp-socket-caps <token> marker). Task record: cloud-prod-network-stack-smoltcp-socket-caps-local-proof. Serving the socket interfaces as inter-process caps (a NetworkManager-like broker) and TcpListener/TcpSocket are folded into the 7c contract relocation.
    • 7c (contract relocation): preserve the cap/network.rs contract behind the userspace network-stack process so the production L4 entry points (virtio_stub.rs) stop returning DeviceUnavailable for the armed manifest. This is the body of the whole-slice record (cloud-prod-userspace-network-stack-smoltcp-local-proof), decomposed because it bundles three independently-large pieces:
      • 7c-i (inter-process socket cap, DONE 2026-06-03): serve the slice-7b userspace UdpSocketCapLayer as a real inter-process transferable capability. A network-stack server process brings the device up, builds the userspace smoltcp UdpSocket layer, and serves the UdpSocket schema (sendTo/recvFrom/close) over an exported Endpoint; a separate client process re-interprets the served cap as a UdpSocket and drives one bounded DNS A query/response through the production UdpSocketClient, with smoltcp still moving every frame through the Nic cap (host_physical_user_visible = 0). Proof make run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc. This is the prerequisite for the kernel contract relocation. cap/network.rs / virtio_stub.rs are unchanged. Task record: cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof.
      • 7c-ii (cap/network.rs relocation, DONE 2026-06-07): route the armed Phase C manifest’s production L4 socket entry points to the userspace network-stack service so the local proof no longer reaches virtio_stub.rs DeviceUnavailable. This is itself decomposed (see “7c-ii Mechanism and Decomposition” below) into 7c-ii(a) – serve TcpListener/TcpSocket as inter-process caps with accept-returns-a-result-cap, an architecture-agnostic prerequisite (cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof) – and 7c-ii(b) – the local serve-from-userspace proof. In 7c-ii(b) (DONE 2026-06-07), the network-stack process boots the non-qemu cloudboot kernel, spawns an application client with only Console plus a userspace-served TcpListenAuthority, returns a userspace-served TcpListener from listen, returns a TcpSocket result cap from accept, and completes one hostfwd TCP request/response through recv/send. The selected architecture remains serve-from-userspace: applications consume the existing typed socket interfaces from a userspace network-stack process instead of extending the legacy kernel-routed socket owner.
      • 7c-iii (TcpListener/TcpSocket, DONE 2026-06-04): the userspace smoltcp stack now serves a real TcpListener/TcpSocket round trip over the Nic cap, using the sustained-receive Nic.receivePoll @4 (slice 7d) for the multi-frame TCP exchange. A single cloudboot service brings the device up from userspace, runs a smoltcp TCP socket listening on port 8080 driven by the non-resetting receivePoll @4 pump, and – against an external host TCP client over a QEMU hostfwd relay – completes one bounded TCP handshake + request/response (asserting the received request equals the expected probe and echoing it back). smoltcp, not hand-rolled frame code, moves every frame; the device stays armed across the SYN/SYN-ACK/ACK/request/response/FIN exchange with no per-frame reset. host_physical_user_visible = 0; queue-address reads refused; the bounce RX pool quiesces + scrubs on teardown. This increment proved the generic TCP socket substrate in-process; 7c-ii(a) later served TcpListener/TcpSocket inter-process with accept-returns-a-result-cap. Neither increment changes cap/network.rs / virtio_stub.rs; that final production-manifest wiring is 7c-ii(b). Proof make run-cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip. Task record: cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip-local-proof. Built on the prerequisite cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof.
  7. Kernel smoltcp / virtio-net removal (Phase C exit)done 2026-06-08. The kernel no longer depends on smoltcp, and the qemu-only cap/network.rs socket entry points now fail closed instead of reaching an in-kernel TCP/UDP runtime. The retained virtio-net code is a lower-layer QEMU fixture for PCI/MMIO/virtqueue, ARP, ICMP, and descriptor-generation proofs; it is not the production cloud socket owner. Task record: cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal.

Each slice is parented to the appropriate predecessor; slices 7-8 deliver the L4 socket reachability the scope decision sequenced after the cloud milestone.

7c-ii Mechanism and Decomposition

7c-ii is “preserve the cap/network.rs contract behind the userspace network-stack process.” Before the serve-from-userspace path landed, the non-qemu cloud kernel could still grant production L4 methods (NetworkManager.createTcpListener, TcpListenAuthority.listen, TcpListener.accept, TcpSocket.send/recv, the UDP methods) through kernel-side CapObjects in kernel/src/cap/network.rs that call crate::virtio::*. In that build, crate::virtio resolves to kernel/src/virtio_stub.rs: creation entry points return NetworkError::DeviceUnavailable, while existing-handle operations fail closed with invalid-handle errors. Before Phase C exit cleanup, the working smoltcp runtime existed only in the cfg(qemu) kernel/src/virtio.rs; it has now been removed, and the qemu socket entry points match the same fail-closed shape. Relocating the contract means production L4 methods are satisfied by the userspace network-stack service instead of the stub, and non-qemu bootstrap grants for the legacy kernel NetworkManager and TcpListenAuthority sources now fail closed before those CapObjects are minted.

Mechanism constraint. A literal reading – the kernel keeps serving the L4 caps and forwards each call to the userspace service – requires the kernel to originate a capability Call to a userspace-served Endpoint and complete the original caller once the service returns. That kernel-as-client-of-a-userspace- service path does not exist today: the ring/endpoint machinery (kernel/src/cap/ring.rs, kernel/src/cap/endpoint.rs, kernel/src/cap/transfer.rs) only dispatches SQEs from a userspace process’s ring; the kernel never enqueues a Call against a userspace endpoint nor parks waiting for the Return. Building that inversion is a new kernel IPC subsystem, not a drive-by, and it adds kernel surface that the Phase C exit deliberately avoids by retiring the kernel L4 owner instead.

7c-ii(b) Architecture Decision: Serve From Userspace

On 2026-06-07 the operator selected serve-from-userspace for the final production-manifest wiring. The armed Phase C manifest receives a userspace-served NetworkManager or TcpListenAuthority cap from the network-stack service. The legacy kernel cap/network.rs / virtio_stub.rs socket path is now fenced to cfg(qemu) fixture manifests and stale negative paths: non-qemu manifests that request kernel network_manager or tcp_listen_authority are rejected during bootstrap, so missing served authority does not fall back to the old kernel socket owner.

The rejected alternative for this stage is kernel-brokered forwarding: keeping the kernel as the socket cap server while it calls into a registered userspace network-stack service. That route would add a new kernel-originated Endpoint/deferred-completion subsystem. The selected direction keeps the microkernel boundary cleaner by moving behavior to userspace where that does not compromise security.

Grounding for this decision:

  • kernel/src/cap/network.rs implements the current kernel-served NetworkManager, TcpListenAuthority, TcpListener, TcpSocket, and UdpSocket objects, including result-cap transfer for listener and socket creation.
  • kernel/src/virtio_stub.rs remains the non-qemu negative-result endpoint for stale kernel networking call sites, but bootstrap no longer grants its NetworkManager or TcpListenAuthority callers in production manifests.
  • kernel/src/cap/ring.rs, kernel/src/cap/endpoint.rs, and kernel/src/cap/transfer.rs implement userspace-originated Endpoint calls, receive/return, and capability transfer. They do not currently implement a kernel-originated call into a userspace endpoint.
  • The 7c-i and 7c-ii(a) task records prove userspace-served socket caps over the existing Endpoint and RETURN result-cap path: cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof and cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof.
  • docs/research/capability-systems-survey.md and docs/research/spritely-captp-ocapn.md reinforce the local capOS rule used here: explicit object references are authority, and forwarding/proxying should make authority flow and lifetime explicit rather than hiding it behind ambient names.

Current constraints that both options must preserve:

  • Existing caller contract. NetworkManager.createTcpListener, TcpListenAuthority.listen, TcpListener.accept, TcpSocket.send/recv, and the UDP methods keep the same typed Cap’n Proto surfaces unless a later implementation task explicitly changes schema and regenerated bindings.
  • Manifest authority. Only the armed Phase C manifest receives the new production L4 path. Missing registration, missing served cap, stale service identity, or an unarmed manifest fails closed instead of falling back to a broad kernel escape hatch; the legacy kernel socket sources are qemu-only fixtures.
  • No new packet authority. The socket service consumes the existing Nic cap and the landed receivePoll @4 sustained-receive path. This decision does not add DMA, MMIO, IRQ, queue-address, host-physical, GCE, public ingress, TLS, or certificate authority.
  • Private/local Web UI critical path. The first Web UI proof remains local and then private GCE reachability. Public ingress, TLS, firewall/DNS exposure, and live-cloud runs stay gated by their separate task records.
AxisKernel-brokered forwarding (rejected for 7c-ii(b))Serve-from-userspace (selected)
Cap server identityThe kernel keeps serving cap/network.rs objects for the armed manifest and forwards each socket operation to a registered userspace network-stack service.The armed manifest receives a userspace-served NetworkManager or TcpListenAuthority cap from the network-stack service, following the 7c-i / 7c-ii(a) serving pattern.
Kernel IPC deltaAdds a kernel-as-client forwarding path: service registration, endpoint identity validation, kernel-originated call construction, transfer handling, cancellation, and deferred completion back to the original caller.Adds no kernel-originated endpoint call path. The existing userspace caller-to-userspace endpoint path carries the socket methods and result caps.
ABI and IPC riskHigher. The existing ring/endpoint code accepts userspace SQEs and endpoint returns; it does not yet encode the kernel as an endpoint caller. The implementation must specify caller-session metadata, transfer rollback, service disappearance, cancellation, and result-cap insertion semantics for forwarded calls.Lower kernel ABI risk. Schema can remain unchanged if the served cap implements the current socket interfaces. Risk shifts to manifest grant wiring, service startup ordering, endpoint lifetime, and making stale or missing service authority fail closed.
Production-surface compatibilityPreserves the literal “kernel-routed cap/network.rs surface” for the armed manifest, which minimizes caller-visible routing change.Makes the production socket cap userspace-served for the armed manifest. Callers still use the same typed socket interfaces, but the authority source is the network-stack service rather than cap/network.rs.
Fail-closed behaviorRequires a bounded registration table and an explicit “no registered service” error path. Forwarding must not silently reach virtio_stub.rs except as a deliberate unarmed-manifest failure mode.Uses manifest-level grant selection and service liveness checks. If the served cap is absent, stale, or not granted, the non-qemu manifest fails closed; virtio_stub.rs remains only a stale-call/fixture negative path.
Validation burdenMust prove forwarding with at least one socket operation, result-cap transfer through the forwarded path, service crash/cancel cleanup, missing-service failure, and no kernel-thread parking. Before Phase C exit, make run-net covered the old qemu-only socket path.Must prove the armed manifest gets the served network cap, completes the socket round trip through the userspace service, rejects an unarmed or missing-service manifest, and preserves the existing inter-process socket-cap proofs. Phase C exit keeps only lower-layer QEMU virtio-net fixture coverage after the old kernel L4 owner is removed.
Phase C exit interactionLeaves a new kernel forwarding subsystem after kernel smoltcp and virtio-net hot-path removal. Slice 8 would have had to decide whether that subsystem was permanent generic IPC infrastructure or Phase C-specific scaffolding.Aligns with the Phase C exit direction that kernel networking becomes routing/grant setup while L4 service behavior lives out of kernel. Slice 8 removed kernel smoltcp/virtio-net L4 ownership and stale socket paths.

Follow-up task-state changes after the selection:

This decision unblocks 7c-ii(b) implementation. It preserves the fact that 7c-ii(a) was sequenced first because serving TcpListener/TcpSocket as inter-process caps is useful under either architecture, and it now becomes the direct serving substrate for the selected path.

Result-cap transfer needs no new ABI. The RETURN transfer-descriptor ABI that lets a served method return a freshly-minted capability already exists: a CAP_OP_RETURN carries xfer_cap_count transfer descriptors, the kernel inserts them into the caller’s table and publishes CAP_CQE_TRANSFER_RESULT_CAPS (kernel/src/cap/ring.rs::dispatch_return -> insert_prepared_transfer_caps -> write_endpoint_return_result), and the client decodes the returned cap with capos-rt’s CompletedCall::result_cap. The 7c-i UDP server already serves caps over an exported Endpoint through exactly this RETURN path, so the userspace-Endpoint-server side is proven; the kernel’s own cap/network.rs::handle_accept (via insert_socket_result_cap) and the telnet-gateway demo (listener.accept_wait; both since retired) proved the complementary accept-returns-a-TcpSocket-cap shape and its client-consume side. What 7c-ii(a) adds is wiring – a userspace Endpoint server returning a TcpSocket result-cap from TcpListener.accept – not a new ABI. Its hard part is the smoltcp-pump-while-serving interleaving for a blocking multi-frame accept, which is why 7c-iii deferred inter-process TCP serving.

Downstream self-hosted Web UI tasks consume slice 7 without making the Phase C exit cleanup a blocker for first GCE operator proof:

Network usability and post-smoltcp follow-ups are decomposed in Network Usability and Post-smoltcp. They do not change the Phase C critical path above: the local DHCP/IPv4 configuration proof is now done for the first GCE Web UI path, while the system DnsResolver cap, POSIX getaddrinfo, ping/ping6 tools, packet tracing, socket readiness policy, and transport tuning/status are follow-on usability or diagnostics lanes.

Slices 7a-7c are smoltcp relocation: they run the selected smoltcp 0.13.0 build in userspace and preserve the socket contract, adding no new transport mechanic. Transport policy/status — read-only transport status, keepalive/ timeout policy inputs, and the deferred congestion-control evaluation — is a distinct control-plane lane decomposed in the backlog under network-transport-policy-status-decomposition, not part of the relocation slices. The userspace relocation track now has landed UDP and TCP substrate proofs, and the TCP build still runs with CongestionControl::None by build configuration; selecting Reno/CUBIC is a build-feature flip, and any custom TCP mechanic requires separate workload evidence and a reviewed task.

The Sustained-Receive Nic ABI (Former Prerequisite For 7c-iii)

7c-iii (TcpListener/TcpSocket) was blocked on the shape of the landed Nic.receive. This section records the precise constraint, the sustained-receive primitive that lifted it while keeping the settled DMA isolation intact, and the ABI decision.

Status: landed. The Nic.receivePoll @4 method and the kernel-owned bounce RX pool primitive designed below shipped in slice 7d (cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof): cap::virtio_net_userspace_rx_dma_proof::nic_receive_poll arms a pool of NIC_RX_POOL_SIZE manager-owned bounce RX buffers and recycles them individually (copy-out + scrub + slot-generation bump + re-post) with no per-frame device reset; the receivePoll @4 dispatch arm lives in cap::nic_grant_source_prod. The multi-frame proof (make run-cloud-prod-nic-driver-userspace-sustained-receive-pool) drains more than one frame with at least one non-resetting framePresent = false poll and keeps the DMA-isolation assertions green (host_physical_user_visible = 0, queue-address reads refused, quiesce + scrub at teardown). receive @1 is unchanged. The rest of this section is the as-built design record.

The constraint, precisely (cite the real symbols)

The production Nic.receive @1 dispatches (under the frontier slice-6 cloud_virtio_net_userspace_clean_tx_rx_split_proof feature) to virtio_net_userspace_rx_dma_proof::nic_receive_independent (kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs:1437, dispatched from the receive @1 arm in kernel/src/cap/nic_grant_source_prod.rs:309). Each call:

  1. allocates one fresh kernel bounce frame (frame::alloc_frame_zeroed),
  2. authors one RX descriptor + avail entry pointing at it and rings the RX doorbell,
  3. waits on the driver’s owned RX interrupt route (device_interrupt::wait_kernel_injected_dispatch) and retires the deferred LAPIC EOI,
  4. polls the RX used ring for one completion, copies the frame out, and
  5. frees that bounce frame (frame::try_free_frame).

Two properties make this single-frame primitive unable to serve TCP:

  • No pool stays armed between calls. Exactly one RX buffer is posted per call and freed when the call returns; between receive calls the device has no posted RX buffer to master into. A frame that arrives outside a call has nowhere to land.
  • No non-resetting “no frame yet”. On a successful frame the independent path keeps the device live (it does not reset – success arm at virtio_net_userspace_rx_dma_proof.rs:1535), but an empty/timeout poll (RxDmaFailure::UsedRingPollExhausted) takes the error arm and quiesces the device (nic_quiesce_device: reset_device + assert queues cleared + release pins). The predecessor proof path drive_rx_dma (virtio_net_userspace_rx_dma_proof.rs:331) resets on every outcome; the independent path narrowed that to reset-on-empty-poll, but the reset-on-empty remains. So a speculative “is there a frame?” poll with nothing waiting tears the device down.

smoltcp drives the opposite shape: its poll() loop calls the phy::Device RX token speculatively and frequently, expecting “no frame yet” to be a cheap, side-effect-free answer against a device that stays armed with multiple posted buffers, so the asynchronous, multi-frame TCP exchange (SYN-ACK, data segments, ACKs, retransmits arriving whenever the peer chooses, sometimes several in quick succession) can be drained as frames arrive. That is why 7c-iii needed the since-landed sustained-receive ABI before the remaining 7c-ii(b) final-wiring task could stay on hold solely for the operator architecture decision.

The reset-on-empty is not a bug: it is part of the settled DMA-isolation model (docs/dma-isolation-design.md). Frames cross through manager-owned bounce pages; userspace never sees a host-physical or device-usable address; and after a buffer is reclaimed the device must be proven to have stopped mastering it (“if in-flight DMA cannot be proven stopped, revocation escalates to device reset”, docs/dma-isolation-design.md DMAPool Invariants -> Reset). The single-frame path proves that the crude way – reset the whole device. The design problem is to keep the isolation guarantee without resetting the device on every empty poll or every reclaimed buffer.

The sustained-receive primitive

Keep the device armed with a kernel-owned bounce RX pool of N buffers and recycle buffers individually instead of resetting the device:

  • Arm. At first use (or at cap setup) the kernel allocates N manager-owned bounce RX buffers from the driver’s granted DMAPool and posts all N to the RX vring avail ring. The device masters only into these kernel-owned bounce pages; userspace still receives no host-physical or device-usable address (host_physical_user_visible = 0), exactly as the single-frame path proved.
  • Drain one arrived frame (per poll). The kernel reads the RX used ring. If used.idx advanced, a frame landed in bounce slot k; the kernel treats slot k as device-written untrusted input (docs/dma-isolation-design.md “Receive buffers are treated as device-written untrusted input until validated by the driver or stack”), copies the frame bytes out into the inline Data reply bounded by the posted buffer length, then recycles slot k.
  • No frame yet. If used.idx did not advance, the call returns “no frame” with no reset and the device stays armed. This is the cheap speculative poll smoltcp needs.
  • Teardown. on_release (and any unprovable-in-flight-DMA error) still quiesces: reset_device, assert both queues cleared, scrub the whole pool, release the enabled-vring pins, and drop the pool – identical to the existing nic_quiesce discipline. Reset remains the escalation path; it is simply no longer the per-frame path.

The per-buffer invariant that replaces “reset before reclaim”: a bounce slot is re-exposed to the device only after its copy-out completes and its slot ownership generation is bumped, with the slot scrubbed before the re-post. This is the production handle-epoch slot identity (slot + slot_generation, docs/dma-isolation-design.md Production Handle Epoch Invariants and DMAPool Invariants: “Buffer operations additionally check the buffer slot and slot generation before descriptor validation, completion accounting, free, scrub, or reuse”) applied at buffer-recycle granularity instead of device-reset granularity:

  1. the device wrote slot k and signalled completion via the used ring (it is no longer mastering that slot – the per-buffer analogue of “in-flight DMA is proven stopped”);
  2. the kernel copies the bytes out;
  3. the kernel scrubs slot k (residual-state rule, docs/dma-isolation-design.md Residual state) and bumps its slot generation, so a stale descriptor, free, or completion for the prior occupant fails closed;
  4. only then does the kernel re-post slot k to the avail ring (re-arm).

As built, the pool buffers are kernel-private frame::alloc_frame_zeroed pages (never a userspace DMABuffer handle), so the slice does not go through the single-frame device_dma begin_rx_dma_drive_pin drive-pin code path (that pin guards a userspace-submitted DMABuffer’s live-unmapped re-validation, which has no analogue for a manager-private frame). Instead the slice applies the same per-buffer slot-identity discipline modeled on the production handle-epoch slot identity (slot + slot_generation, docs/dma-isolation-design.md) as kernel bookkeeping local to the pool: the device masters only into kernel-owned pages, each slot is scrubbed and its generation bumped before re-exposure, and teardown still quiesces (reset) before reclaim. No new isolation backend, no new IOVA-export rule, no host-physical or device-usable address exported (host_physical_user_visible = 0).

ABI decision: extend Nic with a non-resetting poll receive (option a)

Chosen: option (a) – add a non-resetting poll-receive method to the Nic schema and keep receive @1 as the legacy single-shot:

interface Nic {
    transmit  @0 (frame :Data) -> (result, reason, sideEffect);
    receive   @1 () -> (frame :Data, observedEthertype :UInt16,
                        result, reason, sideEffect);   # legacy single-shot
    macAddress @2 () -> (addr :Data, result, reason, sideEffect);
    linkStatus @3 () -> (up :Bool, result, reason, sideEffect);
    # Sustained, non-resetting receive over the armed kernel-owned bounce RX
    # pool. Returns the next arrived frame, or framePresent = false with no
    # device reset when none has arrived. The device stays armed.
    receivePoll @4 () -> (frame :Data, observedEthertype :UInt16,
                          framePresent :Bool,
                          result :Text, reason :Text, sideEffect :Text);
}

How receivePoll @4 reports its two outcomes without a reset:

OutcomeframePresentframeresult / reason / sideEffect
frame arrivedtrueframe bytes inlineok / frame-received / buffer-recycled (slot copied-out, generation bumped, re-posted)
no frame yetfalseemptyok / no-frame / device-armed (no reset; pool still posted)
fail closedfalseemptyfailed / <reason> / device quiesced (escalation only, not the empty-poll path)

receive @1 semantics are unchanged (it still resets on empty poll), so the 7b UDP request/response proof stays green; only the new method is non-resetting.

Why not option (b) (in-stack sustained-receive loop against a kernel primitive). Option (b) – the network-stack process drives a bounded sustained-receive loop against a kernel primitive directly – was rejected because it still requires the identical new kernel RX-pool machinery (the design above) and drives it outside the typed Nic cap boundary, fragmenting the device-facing authority the whole track funnels through one Nic cap (slices 4b-7). Option (a) keeps the network-stack process holding exactly the Nic cap it already holds, maps the new method one-to-one onto the smoltcp phy::Device RX token (“give me the next arrived frame or nothing”), and is the direct extension of the inline-Data receive ABI the track already accepted. The cost of (a) is bounded and already named in the abi hazard: one new schema method, a make generated-code-check regeneration of the checked-in capnp bindings, and updating every exhaustive Nic method match (receivePoll @4 arm in kernel/src/cap/nic_grant_source_prod.rs; the NicClient in capos-rt). A receiveBatch returning List(Data) was considered and deferred: the smoltcp RX token consumes one frame at a time, so single-frame poll is the clean match and batching is a separately-justified later optimization.

Design Grounding

  • docs/proposals/network-reachable-datapath-scope-decision.md – the parent scope decision (Option A) that opened this track.
  • docs/proposals/networking-proposal.md, Part 3 (Phase C) – the accepted decomposition, its prerequisites table and exit criteria, and the Nic draft this doc adopts (inline Data).
  • docs/dma-isolation-design.md S.11.2 – the DMA-isolation invariants and the driver-transition gate, already satisfied by the landed bounce / IOMMU-IOVA track that slice 2 reuses.
  • kernel/src/cap/device_mmio.rs (notify_doorbell), kernel/src/cap/dma_buffer.rs (export labels), kernel/src/cap/virtio_net_polled_provider.rs (kernel-owned vring) – the primitives whose current limits define the cap-surface gap.