# Proposal: Userspace TCP/IP Networking

How capOS gets from "kernel boots" to "userspace process opens a TCP connection."

The host-local Telnet flow on `127.0.0.1:2323` described in Part 2 was a
plaintext, loopback-only **research demo**, not a shippable Telnet service. It
exercised the
`TerminalSession`/`SessionManager`/`AuthorityBroker`/`RestrictedShellLauncher`
boundary over a real TCP socket on the path toward the SSH Shell Gateway
(see [ssh-shell-proposal.md](ssh-shell-proposal.md)). That target is now
retired because it depended on the removed qemu-only kernel TCP listener.
Non-loopback exposure, production credential handling, and any treatment of
Telnet as a long-lived service remain out of scope.

**Historical trust-boundary debt:** Phase A/B kept the smoltcp stack, per-port
TCP listener and accepted-socket capability state, UDP socket cap state, line
discipline byte handler, and Telnet IAC filter inside the kernel. Phase C has
now retired that kernel owner: `kernel` no longer depends on `smoltcp`, the
qemu-only TCP/UDP socket entry points fail closed, and the
`run-network-client`, `run-tcp-listen-authority`, `run-telnet`, and
`run-posix-dns-smoke` fixtures exit with retirement diagnostics. The forward
path is the userspace network stack over `DeviceMmio`/`DMAPool`/`Interrupt`
authority and typed NIC/socket capabilities. New protocol logic belongs in
that Phase C userspace stack.

The Device Driver Foundation now has a bounded provider-consumer proof for one
selected virtio-net TX route: a manifest-granted service can compose
`DMAPool`, `DeviceMmio`, and `Interrupt` authority, validate the selected
bounce-buffer descriptor path, publish a bounded provider-owned queue entry,
ring the selected notify doorbell after policy gates, and consume the matching
used-ring completion through a route-scoped `tx_interrupt.wait` event. That is
proof coverage for a selected manager-owned route, not Phase C completion. It
does not grant full NIC ownership, arbitrary MMIO doorbells, hardware
ack/mask/unmask ownership, direct DMA, IOMMU programming, broader completion
queue ownership, provider storage/NIC drivers, cloud NIC support, or
production networking readiness.

This document has four parts:

- a historical **kernel-internal smoke test** that proved virtio-net and smoltcp,
- historical **in-kernel capability interfaces** for TCP sockets and the Telnet
  Shell Demo,
- **userspace decomposition** after driver authority capabilities exist, and
- cross-cutting TLS and open design questions.

---

## Part 1: Kernel-Internal Networking (Phase A)

Prove that capOS can send and receive TCP/IP traffic. Everything runs in-kernel
— no IPC, no capability syscalls, no multiple processes needed.

### What's Needed

1. **PCI enumeration** — scan config space, find virtio-net device. Uses the
   standalone PCI/PCIe subsystem described in
   [cloud-deployment-proposal.md](cloud-deployment-proposal.md) Phase 4 (~200 lines
   of glue code on top of the shared PCI infrastructure)
2. **virtio-net driver** — init virtqueues, send/receive raw Ethernet frames.
   Use `virtio-drivers` crate or implement manually (~600-800 lines)
3. **Timer** — PIT or LAPIC timer for `smoltcp`'s poll loop (retransmit
   timeouts, `Instant::now()` support). Not a full scheduler — just a
   monotonic clock (~50-100 lines)
4. **smoltcp integration** — implement `phy::Device` trait over the in-kernel
   driver, create an `Interface` with static IP, ICMP ping, then TCP
5. **QEMU flags** — add `-netdev user,id=n0 -device virtio-net-pci,netdev=n0`
   to the Makefile

Current implementation status: PCI enumeration, `make run-net`, modern virtio
PCI transport capability discovery, feature negotiation, RX/TX split-virtqueue
initialization, descriptor-accounting guard evidence, ARP resolution, and ICMP
echo validation are implemented as lower-layer QEMU fixture evidence. The QEMU
default device currently appears as transitional `1af4:1000` but exposes
standard modern vendor capabilities; capOS accepts it only after finding
bounded MMIO common, notify, ISR, and device-specific config regions. The
kernel negotiates `VIRTIO_F_VERSION_1`, `VIRTIO_NET_F_MRG_RXBUF`, and MAC when
safe, allocates kernel-owned DMA pages for the RX/TX queue metadata plus packet
buffers, sets `DRIVER_OK`, submits device-valid TX descriptors, posts RX
descriptors, resolves the QEMU user-mode gateway `10.0.2.2` with ARP from
static guest address `10.0.2.15`, then validates an IPv4 ICMP echo reply from
the gateway, including the reply checksums. The former kernel smoltcp adapter,
TCP HTTP smoke, and scheduler-polled socket runtime are retired; the
`make qemu-net-harness` path now asserts the lower-layer QEMU fixture evidence
instead of a host-backed kernel TCP proof. Current TCP/UDP socket proof lives in
the Phase C userspace network-stack gates, including
`make run-cloud-prod-userspace-network-stack-smoltcp`.

### Milestones

- [x] **Ping**: ICMP echo to QEMU gateway (10.0.2.2 with default user-mode
  net). Achieved by commit `b56a5c1` at `2026-04-24 15:37 UTC`.
- [x] **HTTP**: TCP connection to a host-side server, send GET, receive
  response. Achieved by commit `a4f1722` at `2026-04-24 16:47 UTC`.

### Estimated Scope

~1000-1500 lines of new kernel code. ~200 more for TCP on top of ping.

### Crate Dependencies

| Crate | Purpose | no_std |
|---|---|---|
| `smoltcp` | TCP/IP stack | yes (features: `medium-ethernet`, `proto-ipv4`, `socket-tcp`) |
| `virtio-drivers` | virtio device abstraction | yes (optional — can implement manually) |

### Timer Source Decision

**Historical Phase B resolution:** the scheduler timer advanced the monotonic
`TICK_COUNT` (AtomicU64 in `kernel/src/arch/x86_64/context.rs`), and the
retained kernel smoltcp runtime used that clock instead of a bounded synthetic
10 ms-per-poll clock. Phase C cleanup removed that retained runtime; scheduler
ticks no longer poll kernel smoltcp.

### Intermediate Tickless Bridge

The retained smoltcp runtime described below is retired. The bridge rules are
archival context for why scheduler-polled kernel networking was not acceptable
as a long-term tickless/nohz design. Future socket progress belongs in the
userspace stack or an IRQ/deadline-driven device path, not in scheduler polling.

```rust
trait NetworkPollClock {
    fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
    fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
```

Historical bridge rules:

- a retained smoltcp runtime would have needed to expose `NetworkPollClock`
  before active networking could coexist with tickless idle;
- the scheduler would have included `next_poll_deadline_ns` in
  `earliest_global_deadline()`;
- `poll_until_budget` would have been the only scheduler/idle-exit network
  progress path;
- the budget would have bounded work done outside ordinary process execution;
- absent this bridge, active networking would have forced periodic tick;
- SQPOLL/nohz isolated CPUs would not have run retained network scheduler
  polling.

### QEMU Network Config

| Config | Use case |
|---|---|
| `-netdev user,id=n0 -device virtio-net-pci,netdev=n0` | Default: NAT, guest reaches host |
| `-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0` | Historical host-local TCP forwarding for the retired Telnet Shell Demo |

---

## Part 2: Capability Interfaces — In-Kernel (Phase B)

Phase B turns the Phase A smoke path into first-class TCP capabilities without
moving any code out of the kernel. The `NetworkManager`, `TcpListener`, and
`TcpSocket` objects become kernel-side `CapObject`s that user processes invoke
through the existing capability ring. The in-kernel smoltcp stack stays where
it is; what changes is that it is reached over capability dispatch instead of
a hard-coded boot-time call. UDP and raw `Nic` exposure are not part of this
milestone.

Phase B is the first point where a userspace process — the native shell, a
boot-package demo, a language runtime — can open a TCP socket. It is also the
first point where a visible networking milestone exists at the capability
level.

**Visible Phase B milestone — Telnet Shell Demo (historical; delivered and later retired with the kernel socket owner).** Boot capOS in QEMU with
`-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0`.
Init starts a dedicated `telnet-gateway` service with scoped port-23 listen
authority and restricted shell-launch authority, then gives the child shell
only the exact grants described below.
On accept, the gateway refuses a bounded initial Telnet option negotiation
burst and acts as the terminal host for that connection. It exposes a
socket-backed `TerminalSession` to `capos-shell`, not a raw `TcpSocket`,
`ByteStream`, or `StdIO` replacement for the shell's existing terminal
boundary.
From the host:

```
$ telnet 127.0.0.1 2323
capos login: <anon>
capos$ help
capos$ exit
Connection closed by foreign host.
```

The same boot proves the shell does not know or care whether its interactive
terminal is UART, framebuffer, or TCP-backed Telnet — the `TerminalSession`
provider is interchangeable while the shell-facing authority stays the same.
It also exercises the full TCP listener/accept path, not just the outbound
connect path used by the Phase A HTTP smoke.

`telnet` (RFC 854) is deliberate demo wiring: plaintext, no crypto, no
authentication of its own. The QEMU target binds the host forward to
`127.0.0.1:2323` only and forwards to guest port 23, so the proof is a
host-local development demo rather than a remote-access feature. It is not a
production access path and will be replaced by the SSH gateway described in
[ssh-shell-proposal.md](ssh-shell-proposal.md) once host-key, user-key,
account, audit, and persistence prerequisites are implementable. The value is
that Telnet is the cheapest forcing function for a server-side TCP capability
and for a socket-backed terminal host. The shell still requires credential
verification through the existing login flow
([boot-to-shell-proposal.md](boot-to-shell-proposal.md)); the Telnet transport
only replaces the physical UART, not the login policy.

### Phase B prerequisites

| Prerequisite | State | Why |
|---|---|---|
| Capability syscalls | Stage 4 done (sync) | All Nic/socket access goes through the ring |
| Scheduling + preemption | Stage 5 core done | Socket ops block/wake via the scheduler |
| IPC + capability transfer | Stage 6 3.6 done | Listener hands socket caps to the accepting process |
| `Timer` capability | 7.0.0 done | Historical smoltcp poll clock and socket timeouts; the kernel smoltcp runtime is now retired |
| Scheduler-driven smoltcp poll | retired | The retained smoltcp runtime was polled from scheduler ticks on real `TICK_COUNT`; Phase C cleanup removed it |
| TCP kernel `CapObject`s | retired | `NetworkManager`, `TcpListener`, and `TcpSocket` previously wrapped the retained smoltcp runtime; qemu-only kernel socket entry points now fail closed |
| Socket-backed `TerminalSession` handoff | retired | `TcpSocket.intoTerminalSession` previously consumed a connected socket and returned a move-only `TerminalSession` cap; rebuild this proof on the userspace network stack before using it as validation |
| Shell launch bundle handoff | retired | `telnet-gateway` previously consumed an accepted `TcpSocket` into a move-only `TerminalSession`; the gateway demos are removed and remote-shell coverage lives in the in-guest login smokes (`run-login`, `run-default-web-ui`) |

Phase B does not depend on `DeviceMmio`, `Interrupt`, or `DMAPool` — the NIC
driver stays in the kernel. Security Verification Track S.11.2 is a Phase C
prerequisite, not a Phase B one.

### Phase B schema (kernel `CapObject`s)

These interfaces are now defined in the canonical shared schema
(`schema/capos.capnp`). The current build pipeline watches and generates
bindings for `schema/capos.capnp`; additional networking schema files remain
unnecessary for Phase B.

```capnp
interface NetworkManager {
    getConfig         @0 () -> (addr :Data, netmask :Data, gateway :Data);
    createTcpListener @1 (port :UInt16) -> (listenerIndex :UInt16);
    connectTcp        @2 (addr :Data, port :UInt16) -> (socketIndex :UInt16);
    # POSIX adapter Phase P1.2 Phase A: bind a UDP socket; the created
    # cap is delivered as a transferred result cap.
    createUdpSocket   @3 (localAddr :Data, localPort :UInt16) -> (socketIndex :UInt16);
}

interface TcpListener {
    accept @0 () -> (socketIndex :UInt16, peerAddr :Data, peerPort :UInt16);
    close  @1 () -> ();
}

interface TcpSocket {
    send                @0 (data :Data) -> (bytesSent :UInt32);
    recv                @1 (maxLen :UInt32) -> (data :Data);
    close               @2 () -> ();
    intoTerminalSession @3 () -> (terminalIndex :UInt16);  # retired; fails closed
}

interface UdpSocket {
    sendTo   @0 (addr :Data, port :UInt16, data :Data) -> (bytesSent :UInt32);
    recvFrom @1 (maxLen :UInt32) -> (addr :Data, port :UInt16, data :Data);
    close    @2 () -> ();
}
```

`Nic` stays a separate lower-layer cap (schema shown below) and remains
kernel-internal in Phase B. `UdpSocket` landed for the POSIX adapter Phase
P1.2 Phase A DNS path: the kernel implements it on top of the same retained
smoltcp runtime, and userspace acquires it through `NetworkManager.createUdpSocket`.
It is not part of the Telnet Shell Demo contract.

The ring transport cannot return direct Cap'n Proto capability fields, so
capability-producing methods return result-cap indices in the serialized result
and append `CapTransferResult` records after the message bytes. Runtime clients
adopt those result caps by index.

`accept` and `recv` are blocking capability calls for the Phase B demo: they
complete when a connection or received bytes are available, when the socket is
closed, or when the caller's `cap_enter` timeout/cancellation path fires.
`recv(maxLen)` clamps to the kernel/ring result-buffer limits, and `send` may
return a partial byte count. A readiness/poll interface can be added later
without being required for the first remote shell proof.

### Telnet gateway launch contract

This contract is historical: the `telnet-gateway` demo is removed with the
kernel socket owner and the kernel `SocketTerminalSession`. It is retained as
the authority-model reference for any future userspace terminal host.
`telnet-gateway` was the terminal host for the remote connection. Its minimum
authority was:

- Manifest-forwarded `TcpListenAuthority` badge 23, held by init and forwarded
  to the gateway as the only listener-creation authority for the demo path.
- Manifest-forwarded `RestrictedShellLauncher`, held by init and forwarded to
  the gateway as the only shell process launch authority.
- Pass-through grants for the caps the current shell requires at startup:
  `creds`, `sessions`, `audit`, `broker`, and `system_info`.
- An anonymous `UserSession` minted through `SessionManager` and checked
  through `AuthorityBroker.shellBundle("anonymous")` before launch. The shell
  still performs password login inside `capos-shell` and upgrades the session
  after credential verification.
- A way to provide the child shell a cap named `terminal` whose interface id is
  `TerminalSession`, backed by the accepted TCP socket.

The gateway must not grant the child raw `NetworkManager`, `TcpListener`,
`TcpListenAuthority`, `TcpSocket`, broad `ProcessSpawner`, or
`RestrictedShellLauncher` authority. The retired implementation used the
kernel socket wrapper (`TcpSocket.intoTerminalSession`, now failing closed) to
produce an actual `TerminalSession` `CapObject`; the shell-facing contract
stays `TerminalSession` for any future userspace terminal host.

### Phase B exit criteria

- `schema/capos.capnp` defined the TCP types above; kernel implemented them as
  `CapObject`s on top of the existing smoltcp interface. Initial implementation
  landed in commit `7446e04` at `2026-04-25 14:48 UTC`; review follow-up added
  timer-safe deferred completion cleanup and `make qemu-network-client-harness`
  userspace coverage for outbound sockets and listener accept. This is
  historical Phase B evidence; qemu-only kernel socket entry points now fail
  closed.
- smoltcp polling was driven from the scheduler, not a synthetic clock, so
  sockets could survive longer than a single early-boot burst. That runtime is
  retired.
- A trusted `telnet-gateway` boot service used `TcpListener`/`TcpSocket`,
  refused the bounded initial Telnet negotiation needed by normal host clients,
  and launched `capos-shell` for the accepted connection with a socket-backed
  `TerminalSession` plus the shell's existing login/session caps. The child
  shell did not receive raw network, TCP listener/socket, broad spawn,
  scoped-listener, or restricted-shell-launcher authority. This target is
  retired.
- A dedicated CUE manifest (`system-telnet.cue`) and a `make run-telnet`
  target historically booted the above and ran a scripted host-side smoke that
  completed a login + one command + clean exit over `telnet 127.0.0.1 2323`.
  `make run-telnet` now exits with a retirement diagnostic.

## Part 3: Userspace Decomposition (Phase C)

Phase C moves the NIC driver and the TCP/IP stack out of the kernel into
separate userspace processes, so the kernel is left with only
`DeviceMmio` / `Interrupt` / `DMAPool` dispatch and the cap-ring transport.
Phase B must be complete first — Phase C is about relocating the code that
Phase B already wrapped in capabilities, not about adding new interfaces at
the socket layer.

> **Sequencing relative to the cloud usable-instance milestone.** The
> [Network-Reachable Datapath Scope Decision](network-reachable-datapath-scope-decision.md)
> (2026-06-02) records that the real-GCE-boot milestone's "reachable network
> stack" requirement means **raw-frame TX/RX** over the live NIC (the polled
> production provider), which the billable cloudboot gate already checks. The
> L4 socket reachability that Phase C delivers is therefore a **separate future
> track sequenced after that milestone**, not a milestone blocker.

### IPv6 Support Status And Task Lane

Current capOS L4 socket behavior has one production forward path: the Phase C
userspace service-object stack. The old qemu-only retained smoltcp runtime that
configured `10.0.2.15/24`, installed a default IPv4 route through `10.0.2.2`,
resolved the gateway with ARP, and proved outbound ICMPv4 plus TCP HTTP is
retired. Non-`qemu` production manifests no longer grant the legacy
kernel-owned socket caps; requests for kernel `network_manager` or
`tcp_listen_authority` fail at bootstrap instead of falling through to
`virtio_stub.rs`, and qemu-only kernel TCP/UDP socket entry points fail closed.
The userspace IPv6 lane now has local link-local / Neighbor Discovery, Router
Advertisement / SLAAC, GCE-style DHCPv6 address configuration, ICMPv6 Echo
Reply, and IPv6 TCP listener/connect proofs.

The socket-address ABI is now explicit about address family rather than
overloading a raw four-byte assumption. `schema/capos.capnp` defines
`IpAddressFamily` (`unspecified` / `ipv4` / `ipv6`) and documents a length
contract on every address `Data` field: empty is `unspecified` (only where the
method allows it), 4 bytes is `ipv4`, and 16 bytes is `ipv6`. `getConfig`
reports the configured `addressFamily` and an `ipv6Supported` flag, so an
all-zero IPv4 config is never misread as an IPv6 state.
`kernel/src/cap/network.rs` decodes addresses through a family-typed
`read_ip_address`, accepts IPv4 on the legacy stack, and fails closed on IPv6
there with a distinct `ipv6Unsupported`-class error and on any other length
with a `malformedAddress` class -- so legacy IPv4-only callers reject IPv6
explicitly instead of treating every non-four-byte value as a generic error.
`capos-rt` surfaces the family and IPv6-support flag on `NetworkConfig`. The
wire format stays source-compatible for existing 4-byte IPv4 callers. The
behavior behind the userspace-service ABI now has bounded local IPv6 routing,
diagnostics, and TCP L4 proofs; private GCE reachability and public IPv6
ingress remain unproved.

The pinned userspace `smoltcp` dependency is version 0.13.0 in the networking
demo crates, not in `kernel/Cargo.toml`. capOS enables only the features each
userspace proof needs. The crate has IPv6, SLAAC, and ICMP socket features
available, and it does not provide a `socket-dhcpv6` feature matching its
DHCPv4 socket. With the address-family ABI landed, remaining IPv6 work is
explicit userspace stack behavior and GCE reachability rather than kernel
feature enablement.

The protocol gap is larger than "turn on IPv6": with the local
link-local/Neighbor Discovery, Router Advertisement / SLAAC, GCE-style DHCPv6,
ICMPv6 Echo Reply, and IPv6 TCP listener/connect proofs done, capOS still has
no private GCE IPv6 reachability proof or GCE IPv6 firewall proof. The
standards and cloud grounding are:

- [RFC 4861](https://datatracker.ietf.org/doc/html/rfc4861): Neighbor
  Discovery, Router Solicitation/Advertisement, address resolution, and router
  defaults.
- [RFC 4862](https://datatracker.ietf.org/doc/html/rfc4862): stateless address
  autoconfiguration, link-local address generation, and Duplicate Address
  Detection.
- [RFC 4443](https://datatracker.ietf.org/doc/html/rfc4443): ICMPv6 including
  Echo Request / Echo Reply behavior.
- [RFC 8415](https://datatracker.ietf.org/doc/html/rfc8415): DHCPv6 client and
  server exchanges on UDP 546/547.
- [Compute Engine IPv6 configuration](https://docs.cloud.google.com/compute/docs/ip-addresses/configure-ipv6-address):
  dual-stack or IPv6-only subnet requirement, one `/96` per interface, first
  `/128` configured by DHCPv6 from the metadata server, default route via route
  advertisement, and link-local addresses used for Neighbor Discovery.
- [Google Cloud VPC firewall rules](https://docs.cloud.google.com/firewall/docs/firewalls):
  IPv6 rules are supported, each firewall rule uses either IPv4 or IPv6 ranges,
  and IPv6 ingress needs an explicit allow rule before public access is
  reachable.

The resulting task lane is linked from
[`hardware-boot-storage.md`](../backlog/hardware-boot-storage.md#ipv6-support-lane-non-blocking-for-first-public-web-ui).
The
[`cloud-prod-ipv6-architecture-status-grounding`](../tasks/done/2026-06-03/cloud-prod-ipv6-architecture-status-grounding.md)
scope decision is **done** (2026-06-03), and the address-family ABI entry point
[`cloud-prod-network-address-abi-ipv6`](../tasks/done/2026-06-03/cloud-prod-network-address-abi-ipv6.md)
is **done** (2026-06-03) as historical qemu-only kernel socket evidence. That
target is now retired after kernel socket-owner removal; current
address-family/socket behavior is covered by the Phase C userspace IPv4 and
IPv6 gates below.
The local link-local/Neighbor Discovery proof
[`cloud-prod-ipv6-link-local-nd-local-proof`](../tasks/done/2026-06-08/cloud-prod-ipv6-link-local-nd-local-proof.md)
is **done** (2026-06-08), proved by `make run-cloud-prod-ipv6-link-local-nd`.
The local Router Advertisement / SLAAC proof
[`cloud-prod-ipv6-ra-slaac-local-proof`](../tasks/done/2026-06-08/cloud-prod-ipv6-ra-slaac-local-proof.md)
is **done** (2026-06-08), proved by `make run-cloud-prod-ipv6-ra-slaac`.
The local GCE-style DHCPv6 address configuration proof
[`cloud-prod-ipv6-dhcpv6-gce-config-local-proof`](../tasks/done/2026-06-08/cloud-prod-ipv6-dhcpv6-gce-config-local-proof.md)
is **done** (2026-06-08), proved by
`make run-cloud-prod-ipv6-dhcpv6-gce-config`.
The local ICMPv6 Echo Reply proof
[`cloud-prod-icmpv6-echo-reply-local-proof`](../tasks/done/2026-06-08/cloud-prod-icmpv6-echo-reply-local-proof.md)
is **done** (2026-06-08), proved by `make run-cloud-prod-icmpv6-echo-reply`.
The local IPv6 TCP L4 proof
[`cloud-prod-ipv6-tcp-l4-local-proof`](../tasks/done/2026-06-08/cloud-prod-ipv6-tcp-l4-local-proof.md)
is **done** (2026-06-08), proved by `make run-cloud-prod-ipv6-tcp-l4`.
The lane then sequences private GCE IPv6 and public IPv6 ingress/TLS policy
tasks on top of that userspace-stack substrate.

IPv6 does not block the first public GCE Web UI proof while that proof remains
scoped to IPv4 DHCP, ARP, Phase C L4, private GCE reachability, and reviewed
public HTTPS ingress. It becomes relevant for a later dual-stack or IPv6-only
cloud proof and for public IPv6 ingress policy.

### Network Usability, Resolver, And Post-smoltcp Lane

The network usability backlog is
[`network-usability-post-smoltcp.md`](../backlog/network-usability-post-smoltcp.md).
It records the user-facing work that starts after raw frames and the first
userspace L4 proof: operator status tooling, DHCPv4 lease lifecycle, a typed
system `DnsResolver` cap, POSIX `getaddrinfo` bridging, ping/ping6 diagnostics,
socket readiness/cancel/backpressure semantics, packet trace authority, and
transport policy/status.

Current boundaries are explicit there: the first local DHCP/IPv4 configuration
proof is now done by
[`cloud-prod-network-stack-dhcp-ipv4-config-local-proof`](../tasks/done/2026-06-08/cloud-prod-network-stack-dhcp-ipv4-config-local-proof.md)
and is on the first GCE Web UI critical path, while DHCP renewal/rebind/expiry,
DNS option publication, and operator-visible lease status remain follow-up
work. The local bounded ICMPv4 Echo Reply proof is also done by
[`cloud-prod-icmp-echo-reply-local-proof`](../tasks/done/2026-06-08/cloud-prod-icmp-echo-reply-local-proof.md),
proved by `make run-cloud-prod-icmp-echo-reply`; it answers a bounded local
same-subnet ping and rejects malformed or oversized requests, but it exercises
ICMP *protocol logic* over an in-process `QueuePhyDevice`, not the real bound
NIC. The real-NIC inbound path is now also done by
[`cloud-prod-icmp-echo-reply-real-nic-datapath-local-proof`](../tasks/done/2026-06-08/cloud-prod-icmp-echo-reply-real-nic-datapath-local-proof.md),
proved by `make run-cloud-prod-icmp-echo-reply-real-nic-datapath`: a kernel-owned
responder on the legacy virtio 0.9 datapath acquires a DHCP lease over the real
NIC, then receives an inbound Echo Request over the real RX vring and transmits
an RFC 792 Echo Reply over the same NIC's TX vring (a host peer over a QEMU
`socket` netdev drives the inbound stimulus, since SLIRP drops inbound
host->guest ICMP Echo). Both remain diagnostics rather than Web UI readiness;
the real-NIC proof is the local pre-spend prerequisite for the billable private
GCE ICMP proof and the same responder serves that live run. The POSIX DNS smoke is a hand-rolled
A-query over `UdpSocket`, not a system resolver service or typed resolver
capability. DNS, operator ping tools, IPv6, packet tracing, and advanced
transport policy are usability/completeness lanes, not first public Web UI
blockers unless a later deployment policy explicitly promotes one.

The backlog keeps **smoltcp relocation** (Phase C slices 7a-7c: run the selected
`smoltcp` build in userspace, preserve the socket contract) distinct from
**transport policy/status** (the capOS control plane around it). The selected
userspace stack is `smoltcp 0.13.0` and now has bounded local UDP socket-cap,
TCP listener/socket-cap, sustained receive, and serve-from-userspace production
socket-cap proofs. DHCPv4, DHCPv6, IPv6 L4, and ICMPv6 are explicit protocol
proof lanes rather than ambient production readiness claims; retained qemu-only
fixtures remain separate from the production cloudboot path. The done IPv6
protocol proofs (`cloud-prod-ipv6-dhcpv6-gce-config`, `cloud-prod-ipv6-tcp-l4`)
build their smoltcp interface on an in-process `HarnessPhyDevice` and self-declare
`metadata_only=true`; the IPv6 *datapath* over the real bound NIC is now done by
[`cloud-prod-ipv6-real-nic-datapath-local-proof`](../tasks/done/2026-06-09/cloud-prod-ipv6-real-nic-datapath-local-proof.md),
proved by `make run-cloud-prod-ipv6-real-nic-datapath`: a userspace smoltcp service
on a real-`Nic`-backed phy (the IPv4 DHCP datapath `NicPhyDevice` pattern) learns
the default route from a Router Advertisement, configures the GCE-shaped `/128`
via DHCPv6 Solicit/Advertise/Request/Reply, and completes one ICMPv6 Echo probe --
every frame over `Nic.transmit`/`Nic.receivePoll` against a host peer on a QEMU
`socket` netdev (SLIRP has no stateful DHCPv6 server). That proof records the
real-NIC provenance with no `metadata_only`/in-process disclaimer and is the local
pre-spend prerequisite for the billable private GCE IPv6 reachability proof. No current capOS
build enables `socket-tcp-reno`/`socket-tcp-cubic`, so capOS runs with
`CongestionControl::None` by build configuration, not as a reviewed policy
choice. The
[`network-transport-policy-status-decomposition`](../tasks/done/2026-06-03/network-transport-policy-status-decomposition.md)
task records that audit and decomposes read-only transport status, keepalive/
timeout policy inputs, and a deferred congestion-control evaluation gated on
workload evidence.

### Architecture

```
+--------------------------------------------------+
|  Application Process                             |
|    holds: TcpSocket cap, UdpSocket cap, ...      |
|    calls: connect(), send(), recv() via capnp    |
+---------------------------+----------------------+
                            | IPC (capnp messages)
+---------------------------v----------------------+
|  Network Stack Process (userspace)               |
|    smoltcp TCP/IP stack                          |
|    holds: NIC cap (from driver), Timer cap       |
|    implements: TcpSocket, UdpSocket, Dns caps    |
+---------------------------+----------------------+
                            | IPC (capnp messages)
+---------------------------v----------------------+
|  NIC Driver Process (userspace)                  |
|    virtio-net driver                             |
|    holds: DeviceMmio cap, Interrupt cap, DMAPool |
|    implements: Nic cap                           |
+---------------------------+----------------------+
                            | capability syscalls
+---------------------------v----------------------+
|  Kernel                                          |
|    DeviceMmio cap: maps BAR into driver process  |
|    Interrupt cap: routes virtio IRQ to driver    |
|    DMAPool cap: DMA-eligible frames w/o raw PAs  |
|    Timer cap: provides monotonic clock           |
+--------------------------------------------------+
```

Three separate processes, each with minimal authority:

1. **NIC driver** — only has access to the specific virtio-net device
   registers, its interrupt line, and DMA-eligible frames. Implements the
   `Nic` interface.
2. **Network stack** — holds the `Nic` capability from the driver. Runs
   smoltcp. Implements higher-level socket interfaces.
3. **Application** — holds socket capabilities from the network stack. Cannot
   touch the NIC or raw packets directly.

### Phase C prerequisites (beyond Phase B)

| Prerequisite | Owning gate | Why |
|---|---|---|
| `Interrupt` capability | [DDF Task 5](../backlog/hardware-boot-storage.md) + S.11.2 driver-transition gate | NIC driver receives IRQs without ambient authority |
| `DeviceMmio` capability | [DDF Task 5](../backlog/hardware-boot-storage.md) + S.11.2 driver-transition gate | NIC driver accesses device registers under bounded ownership |
| `DMAPool` capability | [DDF Task 5](../backlog/hardware-boot-storage.md) + S.11.1 invariants + S.11.2 gate | DMA-eligible frames without raw physical grants |
| Provider NIC smoke | [DDF Task 6](../backlog/hardware-boot-storage.md) | First end-to-end provider-driver path through reviewed userspace authority instead of the in-kernel ledger |

See [`docs/dma-isolation-design.md`](../dma-isolation-design.md) for the
concrete invariants the three capabilities must satisfy and the Security
Verification Track S.11.2 gate that unblocks moving the NIC driver out of
the kernel. DDF Task 5 expands those invariants into a reviewable cap-table
and ProcessSpawner manifest surface; DDF Task 6 is the first provider NIC
smoke that consumes them end-to-end.

Current Phase C evidence includes the userspace virtio-net driver slices through
the clean independent `Nic.transmit`/`Nic.receive` split, the 7a local userspace
`smoltcp` substrate over that `Nic` cap, the 7b userspace UDP socket-cap layer,
the 7c-i inter-process `UdpSocket` proof, the 7c-ii(a) inter-process
`TcpListener`/`TcpSocket` proof, the sustained-receive TCP substrate, the
7c-ii(b) local serve-from-userspace production socket-cap proof, and retirement
of the non-`qemu` legacy kernel socket grant path. The 7c-ii(b) proof starts
the userspace network-stack process as the non-`qemu` cloudboot init process,
spawns an application client with only `Console` plus a userspace-served
`TcpListenAuthority`, and completes one local hostfwd TCP request/response
through served `TcpListener`/`TcpSocket` caps. It is still narrower than the
exit criteria below: the proof process keeps the existing
`DeviceMmio`/`DMAPool`/`Interrupt` bring-up caps in-process until the future
driver-service split, the long-lived service shape is still future work, and the
selected GCE Web UI milestone now consumes the done DHCP/IPv4 configuration
proof while still needing the local remote-session Web UI L4 proof, private GCE
reachability, and the tracked Web UI hardening gates. The legacy kernel
`cap/network.rs` / `virtio_stub.rs` socket
route is fixture/negative-path cleanup territory, not the architecture to
extend.

### Phase C exit criteria

- NIC driver runs in its own userspace process, holding only `DeviceMmio`,
  `Interrupt`, and `DMAPool` caps.
- Network stack runs in a second userspace process, holding only the `Nic`
  cap from the driver and a `Timer` cap.
- A successor socket-backed terminal or Web UI proof is rebuilt on the
  userspace network stack; the Phase B Telnet fixture is retired after kernel
  socket-owner removal.
- The kernel contains no `smoltcp` dependency and no virtio-net code on the
  hot path.

### Lower-layer capability schema (drafts — used by Phase C)

Phase B does not expose these to userspace; Phase C does. `Timer` is already
implemented (see `schema/capos.capnp`).

> **Phase C track opened (2026-06-02).** The
> [Phase C Userspace NIC Driver Relocation](phase-c-userspace-nic-driver-relocation.md)
> design adopts this inline-`Data` frame ABI as-is (a `DmaBuffer`-handle
> zero-copy variant was considered and rejected to keep the change small; the
> frame stays in a kernel-owned bounce buffer the polled provider already
> proved). The methods carry the capOS `result`/`reason`/`sideEffect` evidence
> triple, and `receive` also reports the observed EtherType. See that doc for the
> cap-surface gap (no pending security ruling -- the writable common-config
> window extends the accepted notify-doorbell selected-write discipline) and the
> bounded slice chain.
>
> **Slice 1 landed (2026-06-02).** The unimplemented `Nic` interface below is now
> in `schema/capos.capnp` so the later coupled-TX/RX slices (3-4) extend it
> rather than introduce it; no `CapObject` implements it yet. Slice 1
> (`cloud-prod-nic-driver-userspace-features-ok-local-proof`) also relocated the
> virtio device handshake to FEATURES_OK into a userspace driver shim over a
> writable selected-write common-config `DeviceMmio` window (the four handshake
> registers admitted on `DeviceMmio.write32`, queue-address writes fail closed);
> proof `make run-cloud-prod-nic-driver-userspace-features-ok`.

The landed `Nic` schema (inline `Data` + the capOS evidence triple):

```capnp
interface Nic {
    transmit @0 (frame :Data)
        -> (result :Text, reason :Text, sideEffect :Text);
    receive  @1 ()
        -> (frame :Data, observedEthertype :UInt16,
            result :Text, reason :Text, sideEffect :Text);
    macAddress @2 () -> (addr :Data, result :Text, reason :Text, sideEffect :Text);
    linkStatus @3 () -> (up :Bool, result :Text, reason :Text, sideEffect :Text);
}
```

The driver relocation reuses the production `DeviceMmio` cap (a read-only BAR
window with selected writes) and `Interrupt` cap (`schema/capos.capnp`) rather
than the simplified `map`/`wait` sketches earlier drafts of this section used.

## Part 4: Cross-cutting

### Userspace language runtimes that need sockets

Userspace language runtimes that map their stdlib socket APIs onto capOS
capabilities consume the same `TcpSocket`/`UdpSocket` surface this proposal
defines, so the Phase A-B kernel-resident state above is what their socket
imports currently fail closed against:

- The POSIX adapter (`libcapos-posix/`) already maps
  `socket(AF_INET, SOCK_DGRAM, 0)`/`sendto`/`recvfrom`/`close` onto the
  Phase B `UdpSocket` cap for the Phase P1.2 Phase B DNS resolver smoke;
  see [userspace-binaries-proposal.md](userspace-binaries-proposal.md)
  and [posix-adapter-proposal.md](posix-adapter-proposal.md).
- WASI Preview 1 `sock_send` / `sock_recv` route through the WASI host
  adapter on top of the same caps. Phase W.6 (sockets) remains blocked on
  socket authority surfacing through the wasm-host CapSet; the W.2
  `ERRNO_NOSYS` refusal harness in
  [`docs/programming-languages.md`](../programming-languages.md) (WASI / WebAssembly
  row) is the current evidence that no socket authority leaks before that
  gate.

Neither track changes the trust-boundary debt: socket-using userspace
runtimes still depend on the kernel-resident smoltcp stack until Phase C
relocates it.

### TLS Layering

TLS does not live in this proposal: the `TcpSocket` here is the
bottom of the transport stack; a `TlsSocket` wraps it and is
configured from the certificate, trust-store, OCSP, and verifier caps
defined in
[certificates-and-tls-proposal.md](certificates-and-tls-proposal.md).
Keys consumed by TLS come from
[cryptography-and-key-management-proposal.md](cryptography-and-key-management-proposal.md).

Draft shape (tracked in the certificates proposal):

```capnp
interface TlsSocket {
    # Client handshake: wrap an outbound TCP socket with a client config.
    connect @0 (tcp :TcpSocket, config :TlsClientConfig) -> ();
    # Server handshake: accept on a TCP socket with a server config.
    accept  @1 (tcp :TcpSocket, config :TlsServerConfig) -> ();
    send    @2 (data :Data) -> (bytesSent :UInt32);
    recv    @3 (maxLen :UInt32) -> (data :Data);
    close   @4 () -> ();
    peerCertificate @5 () -> (chain :CertificateChain);
    alpnSelected    @6 () -> (protocol :Text);
}
```

### Open Questions

1. **DMA memory management.** Dedicated `DmaAllocator` capability vs extending
   `FrameAllocator` with `allocDma`?
2. **Socket readiness model.** Phase B uses blocking `accept`/`recv` calls
   for the demo. The long-term interface still needs a readiness/poll or
   cancellation shape for multiplexed services.
3. **Buffer ownership.** Copy into IPC message vs shared memory vs capability
   lending?

---

## References

### Crates

- [smoltcp](https://github.com/smoltcp-rs/smoltcp) — `no_std` TCP/IP stack
- [virtio-drivers](https://github.com/rcore-os/virtio-drivers) — `no_std`
  virtio drivers (rCore project)

### Specs

- [virtio 1.2 spec](https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html)
  — Section 5.1 covers network device
- [OSDev Wiki: PCI](https://wiki.osdev.org/PCI),
  [Virtio](https://wiki.osdev.org/Virtio)

### Prior Art

- [rCore](https://github.com/rcore-os/rCore) — virtio-drivers + smoltcp
- [Redox smolnetd](https://gitlab.redox-os.org/redox-os/drivers/-/tree/master/smolnetd)
  — microkernel userspace net stack
- [Fuchsia Netstack3](https://fuchsia.dev/fuchsia-src/concepts/networking/netstack3)
  — capability-oriented, userspace, Rust
- [Hermit](https://github.com/hermit-os/kernel) — unikernel with smoltcp +
  virtio-net

### QEMU

- [QEMU Networking](https://www.qemu.org/docs/master/system/net.html)
- [QEMU virtio-net](https://www.qemu.org/docs/master/system/devices/virtio-net.html)
