# Phase C: Userspace virtio-net Driver Relocation

This is the L4 track opened by the
[Network-Reachable Datapath Scope Decision](network-reachable-datapath-scope-decision.md)
(Option A): raw-frame TX/RX reachability is the cloud milestone bar, and the L4
socket path -- relocating `smoltcp` and the `cap/network.rs` socket caps out of
the `cfg(qemu)` `kernel/src/virtio.rs` into userspace processes
(networking-proposal **Part 3, Phase C**) -- is a separate future track. This doc
designs that track and sequences its slices.

The first Phase C Web UI path remains IPv4-scoped: userspace L4 plus DHCP/IPv4
configuration, ARP, and the private/public GCE Web UI proofs. IPv6 is tracked as
a separate network-stack capability lane in
[`networking-proposal.md`](networking-proposal.md#ipv6-support-status-and-task-lane)
and the hardware/cloud backlog. Phase C must preserve enough address-family
shape for that lane, but lack of IPv6 does not block the first IPv4 GCE Web UI
proof.

## Cap-Surface Delta: What The Userspace Driver Needs

The current `DeviceMmio` / `DMAPool` / `Interrupt` cap surface does not yet host
the virtio-net driver in userspace as built, but the missing pieces are bounded
extensions of accepted patterns and a reuse of the landed production
DMA-isolation track -- not new isolation built from scratch. Per-primitive
evidence:

- **`DeviceMmio` gives a read-only BAR window with one selected write today.**
  `DeviceMmio.map` returns a read-only BAR page; raw `write32` is refused
  (`register_write = "blocked"`), with exactly **one** selected write permitted
  -- the notify doorbell at `@5` (`notify_doorbell`,
  `kernel/src/cap/device_mmio.rs`). A driver must additionally write the virtio
  common-config window (device status, feature-select/feature, queue-select,
  queue-size, queue-address/queue-enable). The relocation adds these as further
  **selected writes** under the same accepted range-check + read-back discipline
  the notify doorbell already enforces (see "The Common-Config Window" below) --
  not a new write primitive.
- **`DMAPool` does not yet export a device-usable address to this driver, but
  the export discipline is landed.** `DMAPool` gives one bounce page; the
  host-physical / device address is not exported in the bounce posture
  (`host_physical_user_visible = false`, `direct_dma = "blocked"`,
  `iova_export = "disabled-future-only"`, `kernel/src/cap/dma_buffer.rs`). The
  vring is kernel-owned today
  (`kernel/src/cap/virtio_net_polled_provider.rs`), so userspace does not yet
  place its own descriptors. The mechanism to let it do so safely -- a
  manager-owned bounce buffer or a domain-scoped IOMMU IOVA, never a raw
  host-physical address -- is **already landed** (the production DMA-isolation
  track; see "The Userspace-Ownable vring Slice" below); the slice-2 work wires
  it to the driver's vring.
- **`Interrupt` is wait-only over a kernel-latched used ring.** `Interrupt.wait`
  reads a kernel-latched used-ring index, `acknowledge` is a no-op, and
  mask/unmask are refused. A real driver owns its IRQ lifecycle (mask, unmask,
  EOI ordering).

Classifying the virtio-net bring-up steps against what userspace can do today
makes the gap concrete -- almost every step is kernel-only:

| Bring-up step | Userspace-doable today? |
|---|---|
| Device reset (write status = 0) | No -- needs writable status register |
| ACKNOWLEDGE / DRIVER status bits | No -- needs writable status register |
| Feature negotiate (select + read + write + FEATURES_OK) | No -- needs writable feature-select/feature + status |
| Queue program (queue-select, queue-size, queue-address, queue-enable) | No -- needs writable common-config + a device-usable vring address |
| vring allocation (avail/used/descriptor tables) | No -- vring is kernel-owned; no device-usable buffer address export |
| DRIVER_OK (write status) | No -- needs writable status register |
| MSI-X program / vector assignment | No -- kernel-owned |
| Submit + notify (ring doorbell) | Partial -- the one selected doorbell write `@5` exists |
| Poll used ring | Partial -- via the kernel-latched index `Interrupt.wait` reads |
| Teardown (reset, scrub, release) | No -- kernel-owned reset path |

## The `Nic` ABI (Inline `Data`, Per the Proposal Draft)

This track keeps the networking-proposal Part 3 frame ABI: frames cross the cap
boundary as inline `Data` (`transmit @0 (frame :Data)`, `receive @1 () ->
(frame :Data)`, networking-proposal:443). The kernel copies the frame into and
out of the manager-owned bounce buffer the polled provider already established,
so no host-physical address or device-usable buffer handle is exported to
userspace and `host_physical_user_visible=0` is preserved. capOS method
convention adds the `result`/`reason`/`sideEffect` evidence triple and the
observed EtherType to the result:

```capnp
interface Nic {
    transmit @0 (frame :Data)
        -> (result :Text, reason :Text, sideEffect :Text);
    receive  @1 ()
        -> (frame :Data, observedEthertype :UInt16,
            result :Text, reason :Text, sideEffect :Text);
    macAddress @2 () -> (addr :Data);
    linkStatus @3 () -> (up :Bool);
}
```

**Why inline `Data`, not a zero-copy buffer handle.** A `DmaBuffer`-handle
(zero-copy) ABI was considered -- it would avoid the per-frame copy -- but
rejected to keep the change small: it introduces a new buffer-ownership protocol
across the cap boundary (who allocates, who frees, lifetime versus the call) on
top of the security work this track already requires, for a copy cost that does
not matter at research scale. Inline `Data` matches the accepted proposal draft,
keeps the frame staging kernel-owned exactly as the polled provider proved, and
defers any zero-copy optimization to a later, separately justified slice.

## The Common-Config Window (Selected-Write, No New Ruling)

Relocation writes the virtio common-config window from userspace through a
bounded, range-checked, **selected-write** path modeled on the existing single
selected write (`notify_doorbell @5`, `kernel/src/cap/device_mmio.rs`;
`device_manager::provider_notify_doorbell_write_for_cap`). This is **the next
register in the accepted selected-write pattern, not a new security relaxation
requiring a ruling.** `DeviceMmio` already refuses raw `write32`
(`register_write = "blocked"`) and admits exactly the claimed selected write,
range-checked against the decoded BAR and followed by a kernel-asserted
read-back; the handshake registers are added to that same admission list under
the same discipline.

The bounded design:

- A **selected-write** common-config window: only the named virtio common-config
  registers needed for handshake (device status, feature-select, device-feature,
  driver-feature) are writable in slice 1, each range-checked against the claimed
  BAR.
- **Read-back-assertion discipline**: every selected write is followed by a
  read-back the kernel asserts, so a userspace driver cannot leave the device in
  an unverified state.
- **Queue-address registers stay fail-closed in slice 1** and are admitted to the
  same selected-write list only in slice 2, where each programmed value must
  resolve to a device-usable address the writing driver was granted (the
  DMA-isolation discipline below decides), so userspace can never point the
  device at arbitrary physical memory.

No new ruling is pending: the project already decided this posture through the
accepted selected-write discipline and the IOVA-export discipline below. Slice 1
is **ready**.

## The Userspace-Ownable vring Slice (Reuses Landed DMA Isolation)

The expensive-sounding piece -- **slice 2**, a userspace-ownable vring plus a
device-usable buffer-address export -- is **wiring already-landed isolation to
the driver's vring, not building isolation from scratch.** The two backends and
the no-host-physical export discipline are all landed:

- **Bounce-buffer path (production default on no-IOMMU shapes).** The runtime
  DMA-backend probe (`kernel/src/dma_backend.rs`, `select_and_report` /
  `probe_verified_usable_iommu`) selects the labeled bounce-buffer fallback
  fail-closed when no usable guest IOMMU is verified, and the manager-owned
  bounce-buffer `DMAPool` / `DMABuffer` lifecycle (`kernel/src/device_dma.rs`,
  scrub-before-free, owner/slot generations, quiesce-before-release) is the
  landed authority a driver uses for device-visible buffer memory
  (`cloud-prod-dmapool-bounce-buffer-grant-proof`).
- **IOMMU-IOVA path (graduates when the probe verifies usable hardware).** The
  Intel VT-d remapping path (`kernel/src/iommu.rs`, `cfg(qemu)` today) programs
  per-device domains, maps manager-owned `DMAPool` pages, and exports **only** a
  domain-scoped IOVA -- never a host-physical address
  (`ddf-iommu-remapping-production-closeout`,
  `ddf-iommu-production-dmapool-ledger-integration`,
  `ddf-iommu-per-device-domain-granularity`,
  `ddf-iommu-production-revoke-teardown-hostile-smokes`,
  `ddf-real-dma-iommu-direct-path`).
- **No-host-physical export discipline.** The IOVA-export discipline
  (`ddf-iommu-iova-export-discipline`) and the
  `host_physical_user_visible = false` / `iova_export = "disabled-future-only"`
  posture (`kernel/src/cap/dma_buffer.rs`) guarantee a driver receives only a
  device-usable address (bounce handle or domain-scoped IOVA), never a raw host
  physical address.

The accepted contract is `docs/dma-isolation-design.md` ("Cloud DMA Backend"
runtime-selection rule and the IOVA-export-discipline clause), and the S.11.2
hostile-smoke matrix is already enforced for both backends. The remaining slice-2
work is therefore to let the userspace driver allocate its vring through the
granted `DMAPool` (bounce, or IOVA-backed when the probe verifies usable
hardware), learn the device-usable address for each ring, and program those
addresses into the queue-address registers over the slice-1 writable window --
under the landed fail-closed / scrub / quiesce / revoke discipline. The
networking-proposal **Phase C prerequisites** table and S.11.2 are satisfied by
the landed track, not deferred to it.

## Bounded Slice Chain

1. **Userspace status + feature handshake to FEATURES_OK** over a writable
   selected-write common-config window (the next register in the accepted
   notify-doorbell selected-write discipline). **[DONE 2026-06-02.]** The
   `cap::devicemmio_grant_source_prod` source stages the virtio-net
   common-config window as a writable selected-write `DeviceMmio` grant
   (`stage_virtio_net_common_config`); the userspace shim drives the handshake
   over `DeviceMmio.read32`/`write32`, the write admission
   (`device_manager::stub::write_devicemmio_u32`) admits only the four handshake
   registers (range-checked + read-back-asserted) and refuses queue-address
   writes; the unimplemented `Nic` stub is in `schema/capos.capnp`. Proof
   `make run-cloud-prod-nic-driver-userspace-features-ok`. Task record:
   `docs/tasks/done/2026-06-02/cloud-prod-nic-driver-userspace-features-ok-local-proof.md`.
2. **Userspace-ownable vring + device-usable address export** -- reuses the
   landed production DMA isolation (bounce policy + `dma_backend` probe + IOMMU
   IOVA-export, S.11.2 already enforced); the work is wiring it to the driver's
   vring. **[DONE 2026-06-03.]** Under
   `cloud_virtio_net_userspace_ownable_vring_proof` (implies slice 1) the
   userspace shim co-receives the writable common-config `DeviceMmio` grant and a
   bounce-buffer `DMAPool` grant on the same virtio-net function; it allocates
   its descriptor / available / used ring pages, learns each buffer's opaque
   device-usable handle from `DMABuffer.info` (`deviceIova`, scope
   `bounce-handle`), and programs `queue_desc` / `queue_driver` / `queue_device`
   over the slice-1 window. `device_manager::stub::write_devicemmio_u32`
   (`admit_virtio_queue_address_write`) resolves each handle against the live
   `DMAPool` grant ledger (`resolve_virtio_vring_device_address`) to the real
   bounce host-physical address, programs that address (never the handle), and
   read-back-asserts; queue-address reads (`0x20..0x38`) are refused so the
   host-physical address is never exposed, and out-of-grant / host-physical /
   stale-generation writes fail closed. `queue_enable` stays fail-closed. Proof
   `make run-cloud-prod-nic-driver-userspace-ownable-vring`. Task record:
   `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-ownable-vring-local-proof.md`.
3. **Userspace queue-program + DRIVER_OK** over the (now device-addressable)
   vring. **[DONE 2026-06-03.]** Under
   `cloud_virtio_net_userspace_queue_enable_driver_ok_proof` (implies slice 2)
   the userspace shim completes device bring-up: after slice 2's queue-address
   programming it writes `queue_enable = 1` (`0x1c`) for its programmed TX queue
   and sets `DRIVER_OK` over the already-writable device-status register.
   `device_manager::stub::write_devicemmio_u32` admits the `queue_enable` write
   only when the active queue's vring memory is live and page-fitting
   (`selected_queue_ready_to_enable`): it reads the active
   `queue_desc`/`queue_driver`/`queue_device` back kernel-side and requires each
   to currently hold the host-physical address of a **live** granted `DMABuffer`
   (a freed buffer's stale address cannot arm a use-after-free DMA target), and
   requires the active `queue_size` to fit every split-ring structure inside one
   granted bounce page; an enable of an unprogrammed, freed, or oversized queue
   fails closed, and the enable is read-back-asserted. Once enabled, the queue's
   vring base registers are immutable -- a queue-address repoint is refused so the
   driver cannot mutate the vring under a running device. The `DRIVER_OK`
   device-status write is kernel-asserted: the kernel re-reads device-status and
   fails closed unless the device latched the
   `ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK` byte **exactly** (rejecting
   `FAILED` and `DEVICE_NEEDS_RESET`).
   Queue-address reads stay refused; no host-physical is exposed; no new DMA
   isolation backend. Proof
   `make run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok`. Task
   record:
   `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-queue-enable-driver-ok-local-proof.md`.
4a. **Userspace RX queue 0 bring-up + buffer-identity binding + ring-buffer
   pinning**, then the **first real RX DMA** from the shim-owned vring. Split into
   two landed/ready sub-slices because the real-DMA hybrid bring-up is large:
   - **4a-i [DONE 2026-06-03].** The shim brings up **RX queue 0** over its own
     vring (slices 1-3 brought up only the TX queue; the `queue_enable` admission
     is queue-agnostic). `device_manager::stub` retains each programmed queue's
     vring physes + originating `DMABuffer` handle **identity** on
     `ProductionDeviceRecord` (`admit_virtio_queue_address_write`), binds
     `queue_enable` to that identity (a freed buffer's stale handle, or a
     freed-then-reallocated frame at the same host-physical address, fails closed
     with `devicemmio-queue-enable-identity-mismatch`), and **pins** the ring
     buffers against `freeBuffer` / process-teardown release while the queue is
     enabled (`dmabuffer-pinned-enabled-vring`), releasing only on disable/reset
     with quiesce. This completes the vring buffer-lifetime binding slice 3 left
     point-in-time at the bring-up boundary.
     No device DMA. Proof `make run-cloud-prod-nic-driver-userspace-rx-bringup`.
     Task record:
     `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-bringup-and-buffer-pinning-local-proof.md`.
   - **4a-ii [DONE 2026-06-03].** The **first real RX DMA** from the shim vring:
     the shim also brings up **TX queue 1** over its own vring, posts one
     device-writable RX receive buffer on queue 0 (`DMABuffer.submitDescriptor`),
     and rings the production `DeviceMmio.notifyDoorbell @5` (the previously
     `Err(stale_handle)` `provider_notify_doorbell_write_for_cap`, now live;
     `cap::devicemmio_grant_source_prod` maps the notify region kernel-side and
     captures the per-queue notify slot offsets). The kernel
     (`cap::virtio_net_userspace_rx_dma_proof`) drives the RX publish over the
     shim's retained RX physes + a kernel-half SLIRP TX ARP stimulus over the
     shim's retained TX physes + one real device->host RX DMA (`used_len > 0`,
     observed EtherType `0x0806`), latches the used-ring index (`int_injected =
     0`, no `Interrupt` cap), and resets the device -- quiescing the queues and
     releasing the ring-buffer pins -- WITHOUT the `Nic` cap. Proof
     `make run-cloud-prod-nic-driver-userspace-rx-bringup` (extended). The
     deterministic freed-then-reallocated-frame identity negative is **split to a
     follow-up**: the next-fit frame allocator (`capos-lib` `FrameBitmap`,
     `free_frame` does not rewind `next_hint`) never returns a just-freed frame on
     the next allocation, so a deterministic same-phys realloc -- needed to reach
     the slice-4a identity gate rather than the slice-3 phys gate -- requires an
     allocator reuse seam. The data path is cooperative-shim-safe (kernel authors
     the descriptor + avail inside the drive window, re-validates the posted buffer
     live + unmapped before publishing, resets on every post-publish path); the
     HOSTILE-shim residuals (kernel-exclusive/unmappable enabled vring rings +
     reset-failure payload quarantine) are closed in 4a-iv; the identity negative
     is closed in 4a-iii. Task records:
     `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-local-proof.md`;
     follow-ups
     `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md`,
     `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md`.
   - **4a-iv [DONE 2026-06-03].** HOSTILE-shim hardening over the 4a-ii data
     path. The shim owns its vring ring buffers (granted bounce DMABuffers it can
     `DMABuffer.map`), so the kernel cannot trust ring contents it did not author
     and lock. The closed gaps: (1) a programmed/**enabled** vring ring buffer is
     made **kernel-exclusive** -- `validate_dmabuffer_map_admission` refuses a NEW
     `DMABuffer.map` of it while the queue is enabled, `selected_queue_identity_bound`
     refuses `queue_enable` while a ring buffer still carries a live user mapping
     (a kept pre-enable VMA) and refuses arming a queue whose ring buffers are not
     pairwise-distinct pages -- both within the queue and across the device's other
     enabled queues (an aliased desc/driver/device, or `rx.desc == tx.desc`, would
     let the kernel-authored ring writes corrupt a descriptor into a non-bounce DMA
     target), and `record_dmabuffer_user_mapping` refuses to record a mapping on a
     programmed/enabled ring (an in-flight SMP map), and the RX-DMA submit/drive
     admission refuses to publish an enabled ring buffer as the RX DMA payload (the
     shim cannot point the device at its own ring page); (2) at
     `queue_enable` the kernel WIPES the queue's descriptor table slot 0, available
     ring, AND used ring (`virtio_net_userspace_rx_dma_proof::sanitize_enabled_queue_rings`),
     so a shim that pre-wrote `avail.idx` / a tampered descriptor / a spoofed
     `used.idx` while the queue was disabled cannot pre-publish it into the enabled
     window -- the device sees an empty queue with no pre-staged completion until
     the kernel-authored drive publishes and a real device DMA advances `used.idx`;
     (3) a per-bdf RX-DMA
     payload drive pin / reset-failure quarantine (`device_manager::stub`
     `begin_rx_dma_drive_pin` / `clear_rx_dma_drive_pin` /
     `mark_rx_dma_payload_quarantine_permanent`, consulted by the `map` admission +
     the record path + the `freeBuffer`/teardown detach) is set atomically with the
     live+unmapped re-validation under the device-table lock for the drive duration,
     cleared after a confirmed reset, and promoted to a permanent quarantine on the
     catastrophic reset-failure path (never downgraded by a later drive). The smoke
     (`make run-cloud-prod-nic-driver-userspace-rx-bringup`, extended) proves a
     still-mapped ring blocks `queue_enable`, the post-enable map refusal on the
     descriptor + driver rings, and that a hostile pre-enable descriptor/`avail.idx`
     tamper does NOT survive (RX DMA still completes with `used_id=0`, ARP
     EtherType). The drive pin's SMP map/free race and the reset-failure quarantine
     are structural fail-closed hardening: the single-CPU cloudboot proof cannot
     reach the race and QEMU virtio reset always succeeds, so they are not
     separately QEMU-observable. **Residual [CLOSED 2026-06-03 in 4a-v]:** two
     SMP-only, bounce-confined races between `DMABuffer.map` and a device-authority
     state transition -- the cap-side `map_page_into_user` installed the user PTE
     before the manager-side record (a manager-rejected map left a transient
     provisional PTE), and the `queue_enable` no-live-mapping check was not atomic
     with the retained `enabled` flag flip (a concurrent map could record a mapping
     in the enable window) -- were not reachable by the single-CPU proof and are
     closed by slice 4a-v below. Predecessor task record:
     `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md`.
   - **4a-iii [DONE 2026-06-03].** The deterministic
     freed-then-reallocated-frame `queue_enable` identity negative carved out of
     4a-ii. A **proof-only** one-shot bounce-frame reuse seam
     (`mem::frame::proof_try_alloc_specific_frame_zeroed` consumed by a
     `device_dma` reuse hint armed on each bounce free, both gated behind
     `cloud_virtio_net_userspace_rx_bringup_proof`, never compiled into
     production) makes a same-host-physical `DMABuffer` realloc reachable from the
     userspace harness despite the production next-fit `FrameBitmap`. The smoke
     programs a transient ring buffer into rx `queue_desc`, frees it, relands a
     fresh-handle buffer on the same frame, and proves `queue_enable` fails closed
     on the slice-4a identity gate (`authority_result =
     devicemmio-queue-enable-identity-mismatch`) -- the recorded phys is live again
     so the slice-3 phys gate passes, and the marker flips
     `identity_realloc_negative=enforced`. The seam does not relax the next-fit
     policy or the identity gate. Proof
     `make run-cloud-prod-nic-driver-userspace-rx-bringup` (extended). Task record:
     `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md`.
   - **4a-v [DONE 2026-06-03].** Closes the two SMP-only, bounce-confined
     `DMABuffer.map`-vs-device-authority races split out of 4a-iv. (1)
     `kernel/src/cap/dma_buffer.rs::map_page_into_user` now takes the manager-side
     mapping record (`record_dmabuffer_user_mapping`, which acquires the
     device-table lock) BEFORE installing the user PTE, with the caller
     address-space lock held across the manager call (lock order address-space ->
     device-table -> quarantine, no reverse nesting): a manager-rejected map returns
     before any PTE is installed, so a concurrent SMP thread can never touch a
     transient provisional mapping at the deterministic auto-picked base. (2)
     `kernel/src/device_manager/stub.rs::try_enable_selected_queue` folds the
     `queue_enable` no-live-mapping identity check and a retained `enable_pinned`
     commit pin into one device-table critical section, set before the MMIO arm
     write and rolled back on a readback mismatch, so a concurrent map cannot record
     a mapping in the enable window. The commit pin is kept distinct from the
     device-armed `enabled` bit (set after the MMIO write): map admission /
     `freeBuffer` pinning consult `enabled || enable_pinned`, while the RX-DMA drive
     gates on `enabled` alone, so a concurrent drive cannot run -- and reset-clear
     the retained state -- inside the arm window. A per-queue `enable_in_progress`
     claim serializes concurrent `queue_enable` transitions (enable vs enable, enable
     vs disable) so a racing `queue_enable = 0` can never clear an in-flight enable's
     commit pin, and the drive's reset cleanup skips clearing while a transition is in
     flight so a stale drive cannot clear a newer enable's pin. The two transitions
     now serialize on the device-table lock: one fails closed. The interleavings are not single-CPU-reachable and the
     live kernel statics are not host-testable, so the proof is an exhaustive Loom
     interleaving model (`capos-config/tests/dmabuffer_map_enable_loom.rs`, run via
     `cargo test-dmabuffer-map-enable-loom`) that asserts no schedule arms a queue
     with a live user PTE on its ring buffer or installs a PTE without an accepted
     record; the single-CPU `make run-cloud-prod-nic-driver-userspace-rx-bringup`
     regression continues to pass unchanged. **Residual [CLOSED 2026-06-03]:** the
     follow-up fenced the drive's post-reset cleanup on a per-queue transition
     **generation**. `RetainedVringQueue::generation` is bumped on each completed
     transition (`mark_selected_queue_armed` enable arm + `finish_selected_queue_disable`
     disable finish); the drive reads it (`retained_queue_generation`) IMMEDIATELY
     before `reset_device` -- the reset boundary, not the far-earlier `enabled`
     sample -- and the cleanup (`mark_retained_vring_queue_disabled_if_epoch`) clears
     the pins only while the queue is still in the epoch the reset quiesced. A
     disable + re-enable that fully completes AFTER the reset advances the
     generation, so the cleanup becomes a no-op instead of clearing the freshly
     armed epoch's pins; one that completed BEFORE the reset is included in the
     captured generation, so its pins clear (no stale over-pin). The only residual
     is a transition completing in the tiny read->reset-MMIO gap whose arm loses the
     race: a fail-closed over-pin, never an under-pin. The Loom model gained
     `fenced_stale_drive_cannot_clear_a_completed_re_enables_pins`, which fails with
     the fence removed (`GENERATION_FENCE = false`). Task record:
     `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-queue-enable-drive-reset-epoch-fence-local-proof.md`.
     Predecessor (4a-v):
     `docs/tasks/done/2026-06-03/dmabuffer-map-record-before-pte-install-ordering-local-proof.md`.
4b. **`Nic`-cap driver process, coupled TX/RX round-trip [DONE 2026-06-03].**
   Implements the slice-1 `Nic` interface stub as a live `CapObject`
   (`kernel/src/cap/nic_grant_source_prod.rs`, granted via the new `nic`
   `KernelCapSource` registered in `capos-config`; client `NicClient` in
   `capos-rt`). Over the SAME shim-brought-up device (RX queue 0 + TX queue 1
   enabled by slices 1-3 + 4a), the cap drives the shim's retained vring physes
   through `virtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive,
   nic_quiesce}`: `receive()` internally drives the coupled ARP-TX-stimulus +
   RX-poll and returns the received frame inline plus the observed EtherType;
   `transmit()` stages a frame into a manager-owned TX bounce page over the
   retained TX vring and rings the doorbell; `macAddress()`/`linkStatus()` read
   the kernel-mapped virtio-net device-config region. Frames cross the cap
   boundary as inline `Data` copied through **manager-owned** kernel bounce pages
   (`host_physical_user_visible = 0`; no host-physical / device-handle exposure);
   the device is left live for the cap's lifetime and quiesced once on cap
   release (`nic_quiesce`: reset + queues-cleared + release the enabled-vring
   pins). Completion stays kernel-latched used-ring polled (`int_injected = 0`,
   no `Interrupt` cap). The clean independent TX/RX split is deferred to slice 6;
   userspace IRQ ownership to slice 5. Proof
   `make run-cloud-prod-nic-driver-userspace-nic-cap-roundtrip` boots the device
   from userspace, round-trips two sequential frames through the typed `Nic` cap
   (observed EtherType `0x0806` over QEMU SLIRP), and emits one
   `cloudboot-evidence: nic-driver-userspace-nic-cap-roundtrip <token>` marker
   with `roundtrips=2`. The same proof releases the parent `DMAPool` cap and one
   pinned ring `DMABuffer` cap before `Nic` release, then shows `Nic` quiesce
   replaying the blocked buffer detach and the pending parent pool detach
   completing after the remaining ring buffers are freed.
   Depends on 4a-ii (the shim-owned-vring data path). Task record:
   `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-nic-cap-roundtrip-local-proof.md`.
5. **Userspace IRQ ownership [DONE 2026-06-03].** The userspace NIC driver's
   `Interrupt` cap for its device RX route becomes **real**, replacing slice-4b's
   kernel-latched used-ring polled completion (`int_injected = 0`). Under
   `cloud_virtio_net_userspace_irq_ownership_proof` (implies slice 4b) a new
   `Interrupt` grant source (`kernel/src/cap/virtio_net_userspace_irq_ownership_proof.rs`,
   replacing the admission-only `interrupt_grant_source_prod` source via the
   `KernelCapSource::Interrupt` arm) programs the staged virtio-net function's RX
   MSI-X route (entry 0) mask-first through the landed always-built
   `cap::interrupt_programmed::program_attach_arm_unmask` and issues a cap whose:
   `wait` blocks on a real interrupt dispatch through the route's MSI-X / LAPIC
   dispatch slot (`device_interrupt::wait_kernel_injected_dispatch`;
   `delivery_count` advances, so `int_injected` flips from 0 -- slice 4b had no
   `Interrupt` cap on the data path at all). The wake is a bounded kernel-injected
   dispatch through the route's real deferred-EOI machinery, **not yet** a
   device-autonomous MSI-X raise causally tied to a specific frame; `Nic.receive`
   still reads the frame bytes from the used ring, so this slice delivers
   IRQ-lifecycle *ownership* (the driver drives the real `wait`/`acknowledge`/
   `mask`/`unmask`), not interrupt-coalesced RX completion. `acknowledge` retires
   exactly one deferred LAPIC EOI through
   `device_interrupt::acknowledge_deferred_lapic_eoi_for_route`
   (`hardwareDispatchAckDelta = 1`); and `mask`/`unmask` toggle the route's own
   MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2) plus the
   manager-attached route state, through `pci::set_msix_table_entry_mask` +
   `device_interrupt::{mask,unmask}_device_manager_attached_route`. The route is
   torn down (`interrupt_programmed::teardown`) on cap release. The driver holds
   this `Interrupt` cap alongside the slice-4b `Nic`/`DeviceMmio`/`DMAPool` caps on
   the same function; it brings the device up from userspace, then drives the owned
   RX-interrupt lifecycle and reads the completed frame back through `Nic.receive`.
   The PCI function-level MSI-X enable bit is **not** toggled and no
   device-autonomous raise is attempted (`device_autonomous_raise=not-attempted`,
   `waiter_wake=kernel-injected-dispatch`); the landed DMA isolation, the slice-2/3 vring
   grants, and the buffer-identity / ring-buffer pinning are reused unchanged
   (`host_physical_user_visible = 0`, queue-address reads still refused). No new
   `Interrupt` interface or method (the existing `wait`/`acknowledge`/`mask`/
   `unmask` become real for this route). Proof
   `make run-cloud-prod-nic-driver-userspace-irq-ownership` emits one
   `cloudboot-evidence: nic-driver-userspace-irq-ownership <token>` marker. Task
   record:
   `docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-irq-ownership-local-proof.md`.
6. **Clean TX/RX split [DONE 2026-06-03].** Decouples the coupled receive path
   into independent TX and RX submission. Under
   `cloud_virtio_net_userspace_clean_tx_rx_split_proof` (implies slice 5) the
   `Nic` cap's `receive @1` dispatches to the new
   `virtio_net_userspace_rx_dma_proof::nic_receive_independent` instead of the
   coupled `nic_receive`: it posts a manager-owned RX receive buffer on the
   retained RX vring, **waits on the driver's OWNED RX interrupt route** (the
   slice-5 `device_interrupt::wait_kernel_injected_dispatch` dispatch slot,
   resolved through `virtio_net_userspace_irq_ownership_proof::owned_rx_route`),
   retires the deferred LAPIC EOI, and reads the completed frame from the RX used
   ring -- with **no internal ARP-TX self-stimulus** (it never submits to the TX
   vring). The RX frame is driven by an **external** stimulus: the consumer's
   preceding independent `Nic.transmit` of a real broadcast ARP request, which
   QEMU SLIRP answers; the inbound reply is held in the host net queue until the
   RX buffer is posted. `Nic.transmit` stays independent (submits the caller's
   frame and rings the TX doorbell with no RX poll; surfaces `rx_polls=0`). The
   wake stays the bounded kernel-injected dispatch slice 5 owns
   (`waiter_wake=kernel-injected-dispatch`, `device_autonomous_raise=not-attempted`).
   Reuses the landed owned-vring / owned-IRQ / DMA-isolation unchanged: no new
   selected-write register, no new MSI-X surface, no new `Nic`/`Interrupt` method
   (`make generated-code-check` green), no host-physical / handle exposure
   (`host_physical_user_visible = 0`, queue-address reads refused). The driver
   does an independent `transmit` then a separate independent `receive`, neither
   performing the other's submission (`tx_independent=ok`, `rx_independent=ok`,
   `receive_self_stimulus=removed`). Proof
   `make run-cloud-prod-nic-driver-userspace-clean-tx-rx-split` emits one
   `cloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token>` marker.
   Task record:
   [`cloud-prod-nic-driver-userspace-clean-tx-rx-split-local-proof`](../tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-clean-tx-rx-split-local-proof.md).
7. **Network-stack process + smoltcp relocation** -- a second userspace process
   holding the `Nic` cap and a bounded time source, running `smoltcp`,
   implementing the socket caps while preserving the `cap/network.rs` contract.
   Slice 6 is now done, so this slice is **decomposed into bounded increments**
   rather than attempted as one step:
   - **7a (first increment, DONE 2026-06-03): network-stack-process skeleton.** A
     userspace process runs a minimal `smoltcp` `Interface` over a `phy::Device`
     adapter backed by the landed independent `Nic.transmit`/`Nic.receive`
     (slice 6), clocked by a `Timer` cap, and drives one observable Ethernet
     exchange through SLIRP: `smoltcp` -- not hand-rolled frame code -- ARPs the
     gateway out through `Nic.transmit`, consumes the reply in through
     `Nic.receive`, and emits the queued IPv4/UDP datagram, so the neighbour cache
     observably advances. No socket caps, no `cap/network.rs` relocation,
     `virtio_stub.rs` unchanged. Proof
     `make run-cloud-prod-network-stack-process-smoltcp-skeleton`.
     **Implementation note:** the landed `Nic` cap is not yet self-sufficient --
     its `transmit`/`receive` ride on the userspace driver shim's *retained vring*
     (the kernel does **not** own the vring), so the skeleton process performs the
     slice-1-6 bring-up itself (it also holds the `DeviceMmio`/`DMAPool`/`Interrupt`
     caps) before running `smoltcp`. Splitting the bring-up into a separate
     long-lived NIC-driver service, so the network-stack process holds only `Nic` +
     `Timer` + `Console`, is folded into the 7c contract-relocation increment; it
     does not change the proven smoltcp-substrate claim. Task record:
     [`cloud-prod-network-stack-process-smoltcp-skeleton-local-proof`](../tasks/done/2026-06-03/cloud-prod-network-stack-process-smoltcp-skeleton-local-proof.md).
   - **7b (socket layer, DONE 2026-06-03): socket caps over the userspace
     smoltcp stack** -- a userspace `UdpSocket` cap layer (`UdpSocketCapLayer`)
     implements the `UdpSocket` schema's `sendTo`/`recvFrom` semantics over the
     7a `Interface` and proves one bounded UDP request/response: a DNS A query
     for `example.com` to SLIRP's resolver at `10.0.2.3:53` via `sendTo`, then
     the decoded response via `recvFrom`. smoltcp drives every frame through the
     `Nic` cap (ARP reply + DNS reply both fetched through `Nic.receive`,
     `host_physical_user_visible = 0` preserved); `Timer` clocks the poll. The
     socket layer is **in-process** -- it does not yet serve the socket
     interfaces as inter-process transferable capabilities, and it does not
     touch `cap/network.rs` (`virtio_stub.rs` stays fail-closed). Proof
     `make run-cloud-prod-network-stack-smoltcp-socket-caps` (one
     `cloudboot-evidence: network-stack-smoltcp-socket-caps <token>` marker).
     Task record:
     [`cloud-prod-network-stack-smoltcp-socket-caps-local-proof`](../tasks/done/2026-06-03/cloud-prod-network-stack-smoltcp-socket-caps-local-proof.md).
     Serving the socket interfaces as inter-process caps (a `NetworkManager`-like
     broker) and `TcpListener`/`TcpSocket` are folded into the 7c contract
     relocation.
   - **7c (contract relocation): preserve the `cap/network.rs` contract** behind
     the userspace network-stack process so the production L4 entry points
     (`virtio_stub.rs`) stop returning `DeviceUnavailable` for the armed
     manifest. This is the body of the whole-slice record
     ([`cloud-prod-userspace-network-stack-smoltcp-local-proof`](../tasks/done/2026-06-07/cloud-prod-userspace-network-stack-smoltcp-local-proof.md)),
     **decomposed** because it bundles three independently-large pieces:
     - **7c-i (inter-process socket cap, DONE 2026-06-03):** serve the slice-7b
       userspace `UdpSocketCapLayer` as a real inter-process transferable
       capability. A network-stack **server** process brings the device up,
       builds the userspace `smoltcp` `UdpSocket` layer, and serves the
       `UdpSocket` schema (`sendTo`/`recvFrom`/`close`) over an exported
       `Endpoint`; a separate **client** process re-interprets the served cap as a
       `UdpSocket` and drives one bounded DNS A query/response through the
       production `UdpSocketClient`, with `smoltcp` still moving every frame
       through the `Nic` cap (`host_physical_user_visible = 0`). Proof
       `make run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc`. This is the
       prerequisite for the kernel contract relocation. `cap/network.rs` /
       `virtio_stub.rs` are unchanged. Task record:
       [`cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof`](../tasks/done/2026-06-03/cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof.md).
     - **7c-ii (`cap/network.rs` relocation, DONE 2026-06-07):** route the armed Phase C
       manifest's production L4 socket entry points to the userspace network-stack
       service so the local proof no longer reaches `virtio_stub.rs`
       `DeviceUnavailable`. This is itself decomposed (see
       "7c-ii Mechanism and Decomposition" below) into
       **7c-ii(a)** -- serve `TcpListener`/`TcpSocket` as inter-process caps with
       accept-returns-a-result-cap, an architecture-agnostic prerequisite
       ([`cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof`](../tasks/done/2026-06-04/cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof.md))
       -- and **7c-ii(b)** -- the local serve-from-userspace proof. In 7c-ii(b)
       (DONE 2026-06-07), the network-stack process boots the non-`qemu`
       cloudboot kernel, spawns an application client with only `Console` plus a
       userspace-served `TcpListenAuthority`, returns a userspace-served
       `TcpListener` from `listen`, returns a `TcpSocket` result cap from
       `accept`, and completes one hostfwd TCP request/response through
       `recv`/`send`. The selected architecture remains **serve-from-userspace**:
       applications consume the existing typed socket interfaces from a
       userspace network-stack process instead of extending the legacy
       kernel-routed socket owner.
     - **7c-iii (`TcpListener`/`TcpSocket`, DONE 2026-06-04):** the userspace
       `smoltcp` stack now serves a real `TcpListener`/`TcpSocket` round trip over
       the `Nic` cap, using the sustained-receive `Nic.receivePoll @4` (slice 7d)
       for the multi-frame TCP exchange. A single cloudboot service brings the
       device up from userspace, runs a `smoltcp` TCP socket listening on port
       8080 driven by the non-resetting `receivePoll @4` pump, and -- against an
       external host TCP client over a QEMU `hostfwd` relay -- completes one
       bounded TCP handshake + request/response (asserting the received request
       equals the expected probe and echoing it back). `smoltcp`, not hand-rolled
       frame code, moves every frame; the device stays armed across the
       SYN/SYN-ACK/ACK/request/response/FIN exchange with no per-frame reset.
       `host_physical_user_visible = 0`; queue-address reads refused; the bounce
       RX pool quiesces + scrubs on teardown. This increment proved the generic
       TCP socket substrate **in-process**; 7c-ii(a) later served
       `TcpListener`/`TcpSocket` inter-process with accept-returns-a-result-cap.
       Neither increment changes `cap/network.rs` / `virtio_stub.rs`; that final
       production-manifest wiring is 7c-ii(b). Proof
       `make run-cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip`. Task
       record:
       [`cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip-local-proof`](../tasks/done/2026-06-04/cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip-local-proof.md).
       Built on the prerequisite
       [`cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof`](../tasks/done/2026-06-04/cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof.md).
8. **Kernel `smoltcp` / virtio-net removal (Phase C exit)** -- **done
   2026-06-08.** The kernel no longer depends on `smoltcp`, and the qemu-only
   `cap/network.rs` socket entry points now fail closed instead of reaching an
   in-kernel TCP/UDP runtime. The retained virtio-net code is a lower-layer QEMU
   fixture for PCI/MMIO/virtqueue, ARP, ICMP, and descriptor-generation proofs;
   it is not the production cloud socket owner. Task record:
   [`cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal`](../tasks/done/2026-06-08/cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal.md).

Each slice is parented to the appropriate predecessor; slices 7-8 deliver the L4
socket reachability the scope decision sequenced after the cloud milestone.

### 7c-ii Mechanism and Decomposition

7c-ii is "preserve the `cap/network.rs` contract behind the userspace
network-stack process." Before the serve-from-userspace path landed, the
non-`qemu` cloud kernel could still grant production L4 methods
(`NetworkManager.createTcpListener`, `TcpListenAuthority.listen`,
`TcpListener.accept`, `TcpSocket.send`/`recv`, the UDP methods) through
kernel-side `CapObject`s in `kernel/src/cap/network.rs` that call
`crate::virtio::*`. In that build, `crate::virtio` resolves to
`kernel/src/virtio_stub.rs`: creation entry points return
`NetworkError::DeviceUnavailable`, while existing-handle operations fail closed
with invalid-handle errors. Before Phase C exit cleanup, the working `smoltcp`
runtime existed only in the `cfg(qemu)` `kernel/src/virtio.rs`; it has now been
removed, and the qemu socket entry points match the same fail-closed shape.
Relocating the contract means production L4 methods are satisfied by the
userspace network-stack service instead of the stub, and non-`qemu` bootstrap
grants for the legacy kernel `NetworkManager` and `TcpListenAuthority` sources
now fail closed before those `CapObject`s are minted.

**Mechanism constraint.** A literal reading -- the kernel keeps serving the L4
caps and *forwards* each call to the userspace service -- requires the kernel to
originate a capability `Call` to a userspace-served `Endpoint` and complete the
original caller once the service returns. That kernel-as-client-of-a-userspace-
service path does **not** exist today: the ring/endpoint machinery
(`kernel/src/cap/ring.rs`, `kernel/src/cap/endpoint.rs`,
`kernel/src/cap/transfer.rs`) only dispatches SQEs *from* a userspace process's
ring; the kernel never enqueues a `Call` against a userspace endpoint nor parks
waiting for the `Return`. Building that inversion is a new kernel IPC subsystem,
not a drive-by, and it adds kernel surface that the Phase C exit deliberately
avoids by retiring the kernel L4 owner instead.

#### 7c-ii(b) Architecture Decision: Serve From Userspace

On 2026-06-07 the operator selected **serve-from-userspace** for the final
production-manifest wiring. The armed Phase C manifest receives a
userspace-served `NetworkManager` or `TcpListenAuthority` cap from the
network-stack service. The legacy kernel `cap/network.rs` / `virtio_stub.rs`
socket path is now fenced to `cfg(qemu)` fixture manifests and stale negative
paths: non-`qemu` manifests that request kernel `network_manager` or
`tcp_listen_authority` are rejected during bootstrap, so missing served
authority does not fall back to the old kernel socket owner.

The rejected alternative for this stage is **kernel-brokered forwarding**:
keeping the kernel as the socket cap server while it calls into a registered
userspace network-stack service. That route would add a new kernel-originated
Endpoint/deferred-completion subsystem. The selected direction keeps the
microkernel boundary cleaner by moving behavior to userspace where that does not
compromise security.

Grounding for this decision:

- `kernel/src/cap/network.rs` implements the current kernel-served
  `NetworkManager`, `TcpListenAuthority`, `TcpListener`, `TcpSocket`, and
  `UdpSocket` objects, including result-cap transfer for listener and socket
  creation.
- `kernel/src/virtio_stub.rs` remains the non-`qemu` negative-result endpoint
  for stale kernel networking call sites, but bootstrap no longer grants its
  `NetworkManager` or `TcpListenAuthority` callers in production manifests.
- `kernel/src/cap/ring.rs`, `kernel/src/cap/endpoint.rs`, and
  `kernel/src/cap/transfer.rs` implement userspace-originated `Endpoint` calls,
  receive/return, and capability transfer. They do not currently implement a
  kernel-originated call into a userspace endpoint.
- The 7c-i and 7c-ii(a) task records prove userspace-served socket caps over
  the existing `Endpoint` and RETURN result-cap path:
  [`cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof`](../tasks/done/2026-06-03/cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof.md)
  and
  [`cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof`](../tasks/done/2026-06-04/cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof.md).
- `docs/research/capability-systems-survey.md` and
  `docs/research/spritely-captp-ocapn.md` reinforce the local capOS rule used
  here: explicit object references are authority, and forwarding/proxying should
  make authority flow and lifetime explicit rather than hiding it behind ambient
  names.

Current constraints that both options must preserve:

- **Existing caller contract.** `NetworkManager.createTcpListener`,
  `TcpListenAuthority.listen`, `TcpListener.accept`, `TcpSocket.send`/`recv`,
  and the UDP methods keep the same typed Cap'n Proto surfaces unless a later
  implementation task explicitly changes schema and regenerated bindings.
- **Manifest authority.** Only the armed Phase C manifest receives the new
  production L4 path. Missing registration, missing served cap, stale service
  identity, or an unarmed manifest fails closed instead of falling back to a
  broad kernel escape hatch; the legacy kernel socket sources are qemu-only
  fixtures.
- **No new packet authority.** The socket service consumes the existing `Nic`
  cap and the landed `receivePoll @4` sustained-receive path. This decision does
  not add DMA, MMIO, IRQ, queue-address, host-physical, GCE, public ingress, TLS,
  or certificate authority.
- **Private/local Web UI critical path.** The first Web UI proof remains local
  and then private GCE reachability. Public ingress, TLS, firewall/DNS exposure,
  and live-cloud runs stay gated by their separate task records.

| Axis | Kernel-brokered forwarding (rejected for 7c-ii(b)) | Serve-from-userspace (selected) |
|---|---|---|
| Cap server identity | The kernel keeps serving `cap/network.rs` objects for the armed manifest and forwards each socket operation to a registered userspace network-stack service. | The armed manifest receives a userspace-served `NetworkManager` or `TcpListenAuthority` cap from the network-stack service, following the 7c-i / 7c-ii(a) serving pattern. |
| Kernel IPC delta | Adds a kernel-as-client forwarding path: service registration, endpoint identity validation, kernel-originated call construction, transfer handling, cancellation, and deferred completion back to the original caller. | Adds no kernel-originated endpoint call path. The existing userspace caller-to-userspace endpoint path carries the socket methods and result caps. |
| ABI and IPC risk | Higher. The existing ring/endpoint code accepts userspace SQEs and endpoint returns; it does not yet encode the kernel as an endpoint caller. The implementation must specify caller-session metadata, transfer rollback, service disappearance, cancellation, and result-cap insertion semantics for forwarded calls. | Lower kernel ABI risk. Schema can remain unchanged if the served cap implements the current socket interfaces. Risk shifts to manifest grant wiring, service startup ordering, endpoint lifetime, and making stale or missing service authority fail closed. |
| Production-surface compatibility | Preserves the literal "kernel-routed `cap/network.rs` surface" for the armed manifest, which minimizes caller-visible routing change. | Makes the production socket cap userspace-served for the armed manifest. Callers still use the same typed socket interfaces, but the authority source is the network-stack service rather than `cap/network.rs`. |
| Fail-closed behavior | Requires a bounded registration table and an explicit "no registered service" error path. Forwarding must not silently reach `virtio_stub.rs` except as a deliberate unarmed-manifest failure mode. | Uses manifest-level grant selection and service liveness checks. If the served cap is absent, stale, or not granted, the non-`qemu` manifest fails closed; `virtio_stub.rs` remains only a stale-call/fixture negative path. |
| Validation burden | Must prove forwarding with at least one socket operation, result-cap transfer through the forwarded path, service crash/cancel cleanup, missing-service failure, and no kernel-thread parking. Before Phase C exit, `make run-net` covered the old qemu-only socket path. | Must prove the armed manifest gets the served network cap, completes the socket round trip through the userspace service, rejects an unarmed or missing-service manifest, and preserves the existing inter-process socket-cap proofs. Phase C exit keeps only lower-layer QEMU virtio-net fixture coverage after the old kernel L4 owner is removed. |
| Phase C exit interaction | Leaves a new kernel forwarding subsystem after kernel `smoltcp` and virtio-net hot-path removal. Slice 8 would have had to decide whether that subsystem was permanent generic IPC infrastructure or Phase C-specific scaffolding. | Aligns with the Phase C exit direction that kernel networking becomes routing/grant setup while L4 service behavior lives out of kernel. Slice 8 removed kernel `smoltcp`/virtio-net L4 ownership and stale socket paths. |

Follow-up task-state changes after the selection:

- [`cloud-prod-userspace-network-stack-smoltcp-local-proof`](../tasks/done/2026-06-07/cloud-prod-userspace-network-stack-smoltcp-local-proof.md)
  landed the local serve-from-userspace proof: manifest grant wiring,
  network-stack service startup/liveness, served `TcpListenAuthority` authority,
  and the final socket round-trip proof. No kernel-originated `Endpoint` call
  task is implied by this choice.
- [`cloud-prod-legacy-kernel-network-socket-path-retirement`](../tasks/done/2026-06-08/cloud-prod-legacy-kernel-network-socket-path-retirement.md)
  retires the armed-manifest route through the legacy kernel socket owner by
  rejecting non-`qemu` kernel `network_manager` / `tcp_listen_authority` grants,
  preserving userspace-served authority as the production path.
- [`cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal`](../tasks/done/2026-06-08/cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal.md)
  is done. It removes the kernel `smoltcp` dependency, retires the qemu-only
  kernel TCP/UDP runtime behind fail-closed socket entry points, and documents
  the remaining virtio-net coverage as lower-layer QEMU fixture evidence rather
  than production cloud socket ownership.
- [`network-socket-terminal-session-userspace-migration`](../tasks/done/2026-06-10/network-socket-terminal-session-userspace-migration.md)
  is the existing follow-up for moving kernel-owned socket terminal byte
  handling out of `kernel/src/cap/network.rs`.
- Keep
  [`cloud-prod-network-stack-dhcp-ipv4-config-local-proof`](../tasks/done/2026-06-08/cloud-prod-network-stack-dhcp-ipv4-config-local-proof.md)
  as the done local DHCP/IPv4 config proof feeding
  [`cloud-prod-remote-session-web-ui-l4-local-proof`](../tasks/done/2026-06-09/cloud-prod-remote-session-web-ui-l4-local-proof.md),
  and
  [`cloud-gce-private-self-hosted-webui-proof`](../tasks/on-hold/cloud-gce-private-self-hosted-webui-proof.md)
  while their remaining existing prerequisites land. Keep public ingress and
  TLS under
  [`cloud-gce-public-self-hosted-webui-ingress-tls`](../tasks/on-hold/cloud-gce-public-self-hosted-webui-ingress-tls.md)
  until separately authorized.

This decision unblocks 7c-ii(b) implementation. It preserves the fact that
7c-ii(a) was sequenced first because serving `TcpListener`/`TcpSocket` as
inter-process caps is useful under either architecture, and it now becomes the
direct serving substrate for the selected path.

**Result-cap transfer needs no new ABI.** The RETURN transfer-descriptor ABI
that lets a served method return a freshly-minted capability already exists: a
`CAP_OP_RETURN` carries `xfer_cap_count` transfer descriptors, the kernel inserts
them into the caller's table and publishes `CAP_CQE_TRANSFER_RESULT_CAPS`
(`kernel/src/cap/ring.rs::dispatch_return` -> `insert_prepared_transfer_caps` ->
`write_endpoint_return_result`), and the client decodes the returned cap with
`capos-rt`'s `CompletedCall::result_cap`. The 7c-i UDP server already serves caps
over an exported `Endpoint` through exactly this RETURN path, so the
userspace-`Endpoint`-server side is proven; the kernel's own
`cap/network.rs::handle_accept` (via `insert_socket_result_cap`) and the
`telnet-gateway` demo (`listener.accept_wait`; both since retired) proved the complementary
accept-returns-a-`TcpSocket`-cap *shape* and its client-consume side. What
7c-ii(a) adds is wiring -- a userspace `Endpoint` server returning a `TcpSocket`
result-cap from `TcpListener.accept` -- not a new ABI. Its hard part is the
smoltcp-pump-while-serving interleaving for a blocking multi-frame `accept`,
which is why 7c-iii deferred inter-process TCP serving.

Downstream self-hosted Web UI tasks consume slice 7 without making the Phase C
exit cleanup a blocker for first GCE operator proof:

- [`remote-session-self-served-full-ui-bundle`](../tasks/done/2026-06-04/remote-session-self-served-full-ui-bundle.md)
  is done and provides the reviewed boot-resource UI bundle.
- [`cloud-prod-remote-session-web-ui-l4-local-proof`](../tasks/done/2026-06-09/cloud-prod-remote-session-web-ui-l4-local-proof.md)
  proves `remote-session-web-ui` over the Phase C L4 path locally after the
  done DHCP/IPv4 configuration proof.
- [`cloud-gce-private-self-hosted-webui-proof`](../tasks/on-hold/cloud-gce-private-self-hosted-webui-proof.md)
  proves the Web UI privately on GCE over the live NIC.
- [`cloud-gce-public-self-hosted-webui-ingress-tls`](../tasks/on-hold/cloud-gce-public-self-hosted-webui-ingress-tls.md)
  is the separate public ingress, TLS, and reviewed-auth posture step.

Network usability and post-smoltcp follow-ups are decomposed in
[`network-usability-post-smoltcp.md`](../backlog/network-usability-post-smoltcp.md).
They do not change the Phase C critical path above: the local DHCP/IPv4
configuration proof is now done for the first GCE Web UI path, while the system
`DnsResolver` cap, POSIX `getaddrinfo`, ping/ping6 tools, packet tracing,
socket readiness policy, and transport tuning/status are follow-on usability or
diagnostics lanes.

Slices 7a-7c are **smoltcp relocation**: they run the selected `smoltcp 0.13.0`
build in userspace and preserve the socket contract, adding no new transport
mechanic. **Transport policy/status** — read-only transport status, keepalive/
timeout policy inputs, and the deferred congestion-control evaluation — is a
distinct control-plane lane decomposed in the backlog under
[`network-transport-policy-status-decomposition`](../tasks/done/2026-06-03/network-transport-policy-status-decomposition.md),
not part of the relocation slices. The userspace relocation track now has
landed UDP and TCP substrate proofs, and the TCP build still runs with
`CongestionControl::None` by build configuration; selecting Reno/CUBIC is a
build-feature flip, and any custom TCP mechanic requires separate workload
evidence and a reviewed task.

## The Sustained-Receive `Nic` ABI (Former Prerequisite For 7c-iii)

7c-iii (`TcpListener`/`TcpSocket`) was blocked on the shape of the landed
`Nic.receive`. This section records the precise constraint, the
sustained-receive primitive that lifted it while keeping the settled DMA
isolation intact, and the ABI decision.

**Status: landed.** The `Nic.receivePoll @4` method and the kernel-owned bounce
RX pool primitive designed below shipped in slice 7d
(`cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof`):
`cap::virtio_net_userspace_rx_dma_proof::nic_receive_poll` arms a pool of
`NIC_RX_POOL_SIZE` manager-owned bounce RX buffers and recycles them
individually (copy-out + scrub + slot-generation bump + re-post) with no
per-frame device reset; the `receivePoll @4` dispatch arm lives in
`cap::nic_grant_source_prod`. The multi-frame proof
(`make run-cloud-prod-nic-driver-userspace-sustained-receive-pool`) drains more
than one frame with at least one non-resetting `framePresent = false` poll and
keeps the DMA-isolation assertions green (`host_physical_user_visible = 0`,
queue-address reads refused, quiesce + scrub at teardown). `receive @1` is
unchanged. The rest of this section is the as-built design record.

### The constraint, precisely (cite the real symbols)

The production `Nic.receive @1` dispatches (under the frontier slice-6
`cloud_virtio_net_userspace_clean_tx_rx_split_proof` feature) to
`virtio_net_userspace_rx_dma_proof::nic_receive_independent`
(`kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs:1437`, dispatched from the
`receive @1` arm in `kernel/src/cap/nic_grant_source_prod.rs:309`). Each call:

1. allocates **one** fresh kernel bounce frame (`frame::alloc_frame_zeroed`),
2. authors one RX descriptor + avail entry pointing at it and rings the RX
   doorbell,
3. waits on the driver's owned RX interrupt route
   (`device_interrupt::wait_kernel_injected_dispatch`) and retires the deferred
   LAPIC EOI,
4. polls the RX used ring for **one** completion, copies the frame out, and
5. **frees** that bounce frame (`frame::try_free_frame`).

Two properties make this single-frame primitive unable to serve TCP:

- **No pool stays armed between calls.** Exactly one RX buffer is posted per
  call and freed when the call returns; between `receive` calls the device has
  no posted RX buffer to master into. A frame that arrives outside a call has
  nowhere to land.
- **No non-resetting "no frame yet".** On a successful frame the independent
  path keeps the device live (it does *not* reset -- success arm at
  `virtio_net_userspace_rx_dma_proof.rs:1535`), but an empty/timeout poll
  (`RxDmaFailure::UsedRingPollExhausted`) takes the error arm and **quiesces the
  device** (`nic_quiesce_device`: `reset_device` + assert queues cleared +
  release pins). The predecessor proof path `drive_rx_dma`
  (`virtio_net_userspace_rx_dma_proof.rs:331`) resets on *every* outcome; the
  independent path narrowed that to reset-on-empty-poll, but the reset-on-empty
  remains. So a speculative "is there a frame?" poll with nothing waiting tears
  the device down.

`smoltcp` drives the opposite shape: its `poll()` loop calls the `phy::Device`
RX token speculatively and frequently, expecting **"no frame yet" to be a cheap,
side-effect-free answer** against a device that **stays armed** with multiple
posted buffers, so the asynchronous, multi-frame TCP exchange (SYN-ACK, data
segments, ACKs, retransmits arriving whenever the peer chooses, sometimes
several in quick succession) can be drained as frames arrive. That is why
7c-iii needed the since-landed sustained-receive ABI before the remaining
7c-ii(b) final-wiring task could stay on hold solely for the operator
architecture decision.

The reset-on-empty is **not a bug**: it is part of the settled DMA-isolation
model (`docs/dma-isolation-design.md`). Frames cross through manager-owned
bounce pages; userspace never sees a host-physical or device-usable address;
and after a buffer is reclaimed the device must be proven to have stopped
mastering it ("if in-flight DMA cannot be proven stopped, revocation escalates
to device reset", `docs/dma-isolation-design.md` `DMAPool` Invariants ->
**Reset**). The single-frame path proves that the crude way -- reset the whole
device. The design problem is to keep the isolation guarantee **without**
resetting the device on every empty poll or every reclaimed buffer.

### The sustained-receive primitive

Keep the device armed with a **kernel-owned bounce RX pool** of `N` buffers and
recycle buffers individually instead of resetting the device:

- **Arm.** At first use (or at cap setup) the kernel allocates `N` manager-owned
  bounce RX buffers from the driver's granted `DMAPool` and posts all `N` to the
  RX vring avail ring. The device masters **only** into these kernel-owned bounce
  pages; userspace still receives no host-physical or device-usable address
  (`host_physical_user_visible = 0`), exactly as the single-frame path proved.
- **Drain one arrived frame (per poll).** The kernel reads the RX used ring. If
  `used.idx` advanced, a frame landed in bounce slot `k`; the kernel treats slot
  `k` as device-written untrusted input (`docs/dma-isolation-design.md`
  "Receive buffers are treated as device-written untrusted input until validated
  by the driver or stack"), copies the frame bytes out into the inline `Data`
  reply bounded by the posted buffer length, then **recycles** slot `k`.
- **No frame yet.** If `used.idx` did not advance, the call returns "no frame"
  with **no reset** and the device stays armed. This is the cheap speculative
  poll `smoltcp` needs.
- **Teardown.** `on_release` (and any unprovable-in-flight-DMA error) still
  quiesces: `reset_device`, assert both queues cleared, scrub the whole pool,
  release the enabled-vring pins, and drop the pool -- identical to the existing
  `nic_quiesce` discipline. Reset remains the escalation path; it is simply no
  longer the per-frame path.

**The per-buffer invariant that replaces "reset before reclaim":** *a bounce
slot is re-exposed to the device only after its copy-out completes and its slot
ownership generation is bumped, with the slot scrubbed before the re-post.* This
is the production handle-epoch slot identity (`slot` + `slot_generation`,
`docs/dma-isolation-design.md` **Production Handle Epoch Invariants** and
`DMAPool` Invariants: "Buffer operations additionally check the buffer slot and
slot generation before descriptor validation, completion accounting, free,
scrub, or reuse") applied at **buffer-recycle granularity** instead of
device-reset granularity:

1. the device wrote slot `k` and signalled completion via the used ring (it is
   no longer mastering that slot -- the per-buffer analogue of "in-flight DMA is
   proven stopped");
2. the kernel copies the bytes out;
3. the kernel scrubs slot `k` (residual-state rule, `docs/dma-isolation-design.md`
   **Residual state**) and bumps its slot generation, so a stale descriptor,
   free, or completion for the prior occupant fails closed;
4. only then does the kernel re-post slot `k` to the avail ring (re-arm).

As built, the pool buffers are kernel-private `frame::alloc_frame_zeroed` pages
(never a userspace `DMABuffer` handle), so the slice does not go through the
single-frame `device_dma` `begin_rx_dma_drive_pin` drive-pin code path (that pin
guards a *userspace-submitted* `DMABuffer`'s live-unmapped re-validation, which
has no analogue for a manager-private frame). Instead the slice applies the
**same per-buffer slot-identity discipline modeled on** the production
handle-epoch slot identity (`slot` + `slot_generation`,
`docs/dma-isolation-design.md`) as kernel bookkeeping local to the pool: the
device masters only into kernel-owned pages, each slot is scrubbed and its
generation bumped before re-exposure, and teardown still quiesces (reset) before
reclaim. No new isolation backend, no new IOVA-export rule, no host-physical or
device-usable address exported (`host_physical_user_visible = 0`).

### ABI decision: extend `Nic` with a non-resetting poll receive (option a)

**Chosen: option (a)** -- add a non-resetting poll-receive method to the `Nic`
schema and keep `receive @1` as the legacy single-shot:

```capnp
interface Nic {
    transmit  @0 (frame :Data) -> (result, reason, sideEffect);
    receive   @1 () -> (frame :Data, observedEthertype :UInt16,
                        result, reason, sideEffect);   # legacy single-shot
    macAddress @2 () -> (addr :Data, result, reason, sideEffect);
    linkStatus @3 () -> (up :Bool, result, reason, sideEffect);
    # Sustained, non-resetting receive over the armed kernel-owned bounce RX
    # pool. Returns the next arrived frame, or framePresent = false with no
    # device reset when none has arrived. The device stays armed.
    receivePoll @4 () -> (frame :Data, observedEthertype :UInt16,
                          framePresent :Bool,
                          result :Text, reason :Text, sideEffect :Text);
}
```

How `receivePoll @4` reports its two outcomes without a reset:

| Outcome | `framePresent` | `frame` | `result` / `reason` / `sideEffect` |
|---|---|---|---|
| frame arrived | `true` | frame bytes inline | `ok` / `frame-received` / `buffer-recycled` (slot copied-out, generation bumped, re-posted) |
| no frame yet | `false` | empty | `ok` / `no-frame` / `device-armed` (no reset; pool still posted) |
| fail closed | `false` | empty | `failed` / `<reason>` / device quiesced (escalation only, not the empty-poll path) |

`receive @1` semantics are **unchanged** (it still resets on empty poll), so the
7b UDP request/response proof stays green; only the new method is non-resetting.

**Why not option (b) (in-stack sustained-receive loop against a kernel
primitive).** Option (b) -- the network-stack process drives a bounded
sustained-receive loop against a kernel primitive directly -- was rejected
because it still requires the identical new kernel RX-pool machinery (the design
above) *and* drives it outside the typed `Nic` cap boundary, fragmenting the
device-facing authority the whole track funnels through one `Nic` cap (slices
4b-7). Option (a) keeps the network-stack process holding exactly the `Nic`
cap it already holds, maps the new method one-to-one onto the `smoltcp`
`phy::Device` RX token ("give me the next arrived frame or nothing"), and is the
direct extension of the inline-`Data` receive ABI the track already accepted.
The cost of (a) is bounded and already named in the abi hazard: one new schema
method, a `make generated-code-check` regeneration of the checked-in capnp
bindings, and updating every exhaustive `Nic` method match (`receivePoll @4`
arm in `kernel/src/cap/nic_grant_source_prod.rs`; the `NicClient` in
`capos-rt`). A `receiveBatch` returning `List(Data)` was considered and deferred:
the `smoltcp` RX token consumes one frame at a time, so single-frame poll is the
clean match and batching is a separately-justified later optimization.

## Design Grounding

- `docs/proposals/network-reachable-datapath-scope-decision.md` -- the parent
  scope decision (Option A) that opened this track.
- `docs/proposals/networking-proposal.md`, Part 3 (Phase C) -- the accepted
  decomposition, its prerequisites table and exit criteria, and the `Nic` draft
  this doc adopts (inline `Data`).
- `docs/dma-isolation-design.md` S.11.2 -- the DMA-isolation invariants and the
  driver-transition gate, already satisfied by the landed bounce / IOMMU-IOVA
  track that slice 2 reuses.
- `kernel/src/cap/device_mmio.rs` (`notify_doorbell`),
  `kernel/src/cap/dma_buffer.rs` (export labels),
  `kernel/src/cap/virtio_net_polled_provider.rs` (kernel-owned vring) -- the
  primitives whose current limits define the cap-surface gap.
