Proposal: Userspace TCP/IP Networking
How capOS gets from “kernel boots” to “userspace process opens a TCP connection.”
The host-local Telnet flow on 127.0.0.1:2323 described in Part 2 was a
plaintext, loopback-only research demo, not a shippable Telnet service. It
exercised the
TerminalSession/SessionManager/AuthorityBroker/RestrictedShellLauncher
boundary over a real TCP socket on the path toward the SSH Shell Gateway
(see SSH Shell Gateway). That target is now
retired because it depended on the removed qemu-only kernel TCP listener.
Non-loopback exposure, production credential handling, and any treatment of
Telnet as a long-lived service remain out of scope.
Historical trust-boundary debt: Phase A/B kept the smoltcp stack, per-port
TCP listener and accepted-socket capability state, UDP socket cap state, line
discipline byte handler, and Telnet IAC filter inside the kernel. Phase C has
now retired that kernel owner: kernel no longer depends on smoltcp, the
qemu-only TCP/UDP socket entry points fail closed, and the
run-network-client, run-tcp-listen-authority, run-telnet, and
run-posix-dns-smoke fixtures exit with retirement diagnostics. The forward
path is the userspace network stack over DeviceMmio/DMAPool/Interrupt
authority and typed NIC/socket capabilities. New protocol logic belongs in
that Phase C userspace stack.
The Device Driver Foundation now has a bounded provider-consumer proof for one
selected virtio-net TX route: a manifest-granted service can compose
DMAPool, DeviceMmio, and Interrupt authority, validate the selected
bounce-buffer descriptor path, publish a bounded provider-owned queue entry,
ring the selected notify doorbell after policy gates, and consume the matching
used-ring completion through a route-scoped tx_interrupt.wait event. That is
proof coverage for a selected manager-owned route, not Phase C completion. It
does not grant full NIC ownership, arbitrary MMIO doorbells, hardware
ack/mask/unmask ownership, direct DMA, IOMMU programming, broader completion
queue ownership, provider storage/NIC drivers, cloud NIC support, or
production networking readiness.
This document has four parts:
- a historical kernel-internal smoke test that proved virtio-net and smoltcp,
- historical in-kernel capability interfaces for TCP sockets and the Telnet Shell Demo,
- userspace decomposition after driver authority capabilities exist, and
- cross-cutting TLS and open design questions.
Part 1: Kernel-Internal Networking (Phase A)
Prove that capOS can send and receive TCP/IP traffic. Everything runs in-kernel — no IPC, no capability syscalls, no multiple processes needed.
What’s Needed
- PCI enumeration — scan config space, find virtio-net device. Uses the standalone PCI/PCIe subsystem described in Cloud Deployment Phase 4 (~200 lines of glue code on top of the shared PCI infrastructure)
- virtio-net driver — init virtqueues, send/receive raw Ethernet frames.
Use
virtio-driverscrate or implement manually (~600-800 lines) - Timer — PIT or LAPIC timer for
smoltcp’s poll loop (retransmit timeouts,Instant::now()support). Not a full scheduler — just a monotonic clock (~50-100 lines) - smoltcp integration — implement
phy::Devicetrait over the in-kernel driver, create anInterfacewith static IP, ICMP ping, then TCP - QEMU flags — add
-netdev user,id=n0 -device virtio-net-pci,netdev=n0to the Makefile
Current implementation status: PCI enumeration, make run-net, modern virtio
PCI transport capability discovery, feature negotiation, RX/TX split-virtqueue
initialization, descriptor-accounting guard evidence, ARP resolution, and ICMP
echo validation are implemented as lower-layer QEMU fixture evidence. The QEMU
default device currently appears as transitional 1af4:1000 but exposes
standard modern vendor capabilities; capOS accepts it only after finding
bounded MMIO common, notify, ISR, and device-specific config regions. The
kernel negotiates VIRTIO_F_VERSION_1, VIRTIO_NET_F_MRG_RXBUF, and MAC when
safe, allocates kernel-owned DMA pages for the RX/TX queue metadata plus packet
buffers, sets DRIVER_OK, submits device-valid TX descriptors, posts RX
descriptors, resolves the QEMU user-mode gateway 10.0.2.2 with ARP from
static guest address 10.0.2.15, then validates an IPv4 ICMP echo reply from
the gateway, including the reply checksums. The former kernel smoltcp adapter,
TCP HTTP smoke, and scheduler-polled socket runtime are retired; the
make qemu-net-harness path now asserts the lower-layer QEMU fixture evidence
instead of a host-backed kernel TCP proof. Current TCP/UDP socket proof lives in
the Phase C userspace network-stack gates, including
make run-cloud-prod-userspace-network-stack-smoltcp.
Milestones
- Ping: ICMP echo to QEMU gateway (10.0.2.2 with default user-mode
net). Achieved by commit
b56a5c1at2026-04-24 15:37 UTC. - HTTP: TCP connection to a host-side server, send GET, receive
response. Achieved by commit
a4f1722at2026-04-24 16:47 UTC.
Estimated Scope
~1000-1500 lines of new kernel code. ~200 more for TCP on top of ping.
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
smoltcp | TCP/IP stack | yes (features: medium-ethernet, proto-ipv4, socket-tcp) |
virtio-drivers | virtio device abstraction | yes (optional — can implement manually) |
Timer Source Decision
Historical Phase B resolution: the scheduler timer advanced the monotonic
TICK_COUNT (AtomicU64 in kernel/src/arch/x86_64/context.rs), and the
retained kernel smoltcp runtime used that clock instead of a bounded synthetic
10 ms-per-poll clock. Phase C cleanup removed that retained runtime; scheduler
ticks no longer poll kernel smoltcp.
Intermediate Tickless Bridge
The retained smoltcp runtime described below is retired. The bridge rules are archival context for why scheduler-polled kernel networking was not acceptable as a long-term tickless/nohz design. Future socket progress belongs in the userspace stack or an IRQ/deadline-driven device path, not in scheduler polling.
#![allow(unused)]
fn main() {
trait NetworkPollClock {
fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}
Historical bridge rules:
- a retained smoltcp runtime would have needed to expose
NetworkPollClockbefore active networking could coexist with tickless idle; - the scheduler would have included
next_poll_deadline_nsinearliest_global_deadline(); poll_until_budgetwould have been the only scheduler/idle-exit network progress path;- the budget would have bounded work done outside ordinary process execution;
- absent this bridge, active networking would have forced periodic tick;
- SQPOLL/nohz isolated CPUs would not have run retained network scheduler polling.
QEMU Network Config
| Config | Use case |
|---|---|
-netdev user,id=n0 -device virtio-net-pci,netdev=n0 | Default: NAT, guest reaches host |
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0 | Historical host-local TCP forwarding for the retired Telnet Shell Demo |
Part 2: Capability Interfaces — In-Kernel (Phase B)
Phase B turns the Phase A smoke path into first-class TCP capabilities without
moving any code out of the kernel. The NetworkManager, TcpListener, and
TcpSocket objects become kernel-side CapObjects that user processes invoke
through the existing capability ring. The in-kernel smoltcp stack stays where
it is; what changes is that it is reached over capability dispatch instead of
a hard-coded boot-time call. UDP and raw Nic exposure are not part of this
milestone.
Phase B is the first point where a userspace process — the native shell, a boot-package demo, a language runtime — can open a TCP socket. It is also the first point where a visible networking milestone exists at the capability level.
Visible Phase B milestone — Telnet Shell Demo (historical; delivered and later retired with the kernel socket owner). Boot capOS in QEMU with
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0.
Init starts a dedicated telnet-gateway service with scoped port-23 listen
authority and restricted shell-launch authority, then gives the child shell
only the exact grants described below.
On accept, the gateway refuses a bounded initial Telnet option negotiation
burst and acts as the terminal host for that connection. It exposes a
socket-backed TerminalSession to capos-shell, not a raw TcpSocket,
ByteStream, or StdIO replacement for the shell’s existing terminal
boundary.
From the host:
$ telnet 127.0.0.1 2323
capos login: <anon>
capos$ help
capos$ exit
Connection closed by foreign host.
The same boot proves the shell does not know or care whether its interactive
terminal is UART, framebuffer, or TCP-backed Telnet — the TerminalSession
provider is interchangeable while the shell-facing authority stays the same.
It also exercises the full TCP listener/accept path, not just the outbound
connect path used by the Phase A HTTP smoke.
telnet (RFC 854) is deliberate demo wiring: plaintext, no crypto, no
authentication of its own. The QEMU target binds the host forward to
127.0.0.1:2323 only and forwards to guest port 23, so the proof is a
host-local development demo rather than a remote-access feature. It is not a
production access path and will be replaced by the SSH gateway described in
SSH Shell Gateway once host-key, user-key,
account, audit, and persistence prerequisites are implementable. The value is
that Telnet is the cheapest forcing function for a server-side TCP capability
and for a socket-backed terminal host. The shell still requires credential
verification through the existing login flow
(Boot to Shell); the Telnet transport
only replaces the physical UART, not the login policy.
Phase B prerequisites
| Prerequisite | State | Why |
|---|---|---|
| Capability syscalls | Stage 4 done (sync) | All Nic/socket access goes through the ring |
| Scheduling + preemption | Stage 5 core done | Socket ops block/wake via the scheduler |
| IPC + capability transfer | Stage 6 3.6 done | Listener hands socket caps to the accepting process |
Timer capability | 7.0.0 done | Historical smoltcp poll clock and socket timeouts; the kernel smoltcp runtime is now retired |
| Scheduler-driven smoltcp poll | retired | The retained smoltcp runtime was polled from scheduler ticks on real TICK_COUNT; Phase C cleanup removed it |
TCP kernel CapObjects | retired | NetworkManager, TcpListener, and TcpSocket previously wrapped the retained smoltcp runtime; qemu-only kernel socket entry points now fail closed |
Socket-backed TerminalSession handoff | retired | TcpSocket.intoTerminalSession previously consumed a connected socket and returned a move-only TerminalSession cap; rebuild this proof on the userspace network stack before using it as validation |
| Shell launch bundle handoff | retired | telnet-gateway previously consumed an accepted TcpSocket into a move-only TerminalSession; the gateway demos are removed and remote-shell coverage lives in the in-guest login smokes (run-login, run-default-web-ui) |
Phase B does not depend on DeviceMmio, Interrupt, or DMAPool — the NIC
driver stays in the kernel. Security Verification Track S.11.2 is a Phase C
prerequisite, not a Phase B one.
Phase B schema (kernel CapObjects)
These interfaces are now defined in the canonical shared schema
(schema/capos.capnp). The current build pipeline watches and generates
bindings for schema/capos.capnp; additional networking schema files remain
unnecessary for Phase B.
interface NetworkManager {
getConfig @0 () -> (addr :Data, netmask :Data, gateway :Data);
createTcpListener @1 (port :UInt16) -> (listenerIndex :UInt16);
connectTcp @2 (addr :Data, port :UInt16) -> (socketIndex :UInt16);
# POSIX adapter Phase P1.2 Phase A: bind a UDP socket; the created
# cap is delivered as a transferred result cap.
createUdpSocket @3 (localAddr :Data, localPort :UInt16) -> (socketIndex :UInt16);
}
interface TcpListener {
accept @0 () -> (socketIndex :UInt16, peerAddr :Data, peerPort :UInt16);
close @1 () -> ();
}
interface TcpSocket {
send @0 (data :Data) -> (bytesSent :UInt32);
recv @1 (maxLen :UInt32) -> (data :Data);
close @2 () -> ();
intoTerminalSession @3 () -> (terminalIndex :UInt16); # retired; fails closed
}
interface UdpSocket {
sendTo @0 (addr :Data, port :UInt16, data :Data) -> (bytesSent :UInt32);
recvFrom @1 (maxLen :UInt32) -> (addr :Data, port :UInt16, data :Data);
close @2 () -> ();
}
Nic stays a separate lower-layer cap (schema shown below) and remains
kernel-internal in Phase B. UdpSocket landed for the POSIX adapter Phase
P1.2 Phase A DNS path: the kernel implements it on top of the same retained
smoltcp runtime, and userspace acquires it through NetworkManager.createUdpSocket.
It is not part of the Telnet Shell Demo contract.
The ring transport cannot return direct Cap’n Proto capability fields, so
capability-producing methods return result-cap indices in the serialized result
and append CapTransferResult records after the message bytes. Runtime clients
adopt those result caps by index.
accept and recv are blocking capability calls for the Phase B demo: they
complete when a connection or received bytes are available, when the socket is
closed, or when the caller’s cap_enter timeout/cancellation path fires.
recv(maxLen) clamps to the kernel/ring result-buffer limits, and send may
return a partial byte count. A readiness/poll interface can be added later
without being required for the first remote shell proof.
Telnet gateway launch contract
This contract is historical: the telnet-gateway demo is removed with the
kernel socket owner and the kernel SocketTerminalSession. It is retained as
the authority-model reference for any future userspace terminal host.
telnet-gateway was the terminal host for the remote connection. Its minimum
authority was:
- Manifest-forwarded
TcpListenAuthoritybadge 23, held by init and forwarded to the gateway as the only listener-creation authority for the demo path. - Manifest-forwarded
RestrictedShellLauncher, held by init and forwarded to the gateway as the only shell process launch authority. - Pass-through grants for the caps the current shell requires at startup:
creds,sessions,audit,broker, andsystem_info. - An anonymous
UserSessionminted throughSessionManagerand checked throughAuthorityBroker.shellBundle("anonymous")before launch. The shell still performs password login insidecapos-shelland upgrades the session after credential verification. - A way to provide the child shell a cap named
terminalwhose interface id isTerminalSession, backed by the accepted TCP socket.
The gateway must not grant the child raw NetworkManager, TcpListener,
TcpListenAuthority, TcpSocket, broad ProcessSpawner, or
RestrictedShellLauncher authority. The retired implementation used the
kernel socket wrapper (TcpSocket.intoTerminalSession, now failing closed) to
produce an actual TerminalSession CapObject; the shell-facing contract
stays TerminalSession for any future userspace terminal host.
Phase B exit criteria
schema/capos.capnpdefined the TCP types above; kernel implemented them asCapObjects on top of the existing smoltcp interface. Initial implementation landed in commit7446e04at2026-04-25 14:48 UTC; review follow-up added timer-safe deferred completion cleanup andmake qemu-network-client-harnessuserspace coverage for outbound sockets and listener accept. This is historical Phase B evidence; qemu-only kernel socket entry points now fail closed.- smoltcp polling was driven from the scheduler, not a synthetic clock, so sockets could survive longer than a single early-boot burst. That runtime is retired.
- A trusted
telnet-gatewayboot service usedTcpListener/TcpSocket, refused the bounded initial Telnet negotiation needed by normal host clients, and launchedcapos-shellfor the accepted connection with a socket-backedTerminalSessionplus the shell’s existing login/session caps. The child shell did not receive raw network, TCP listener/socket, broad spawn, scoped-listener, or restricted-shell-launcher authority. This target is retired. - A dedicated CUE manifest (
system-telnet.cue) and amake run-telnettarget historically booted the above and ran a scripted host-side smoke that completed a login + one command + clean exit overtelnet 127.0.0.1 2323.make run-telnetnow exits with a retirement diagnostic.
Part 3: Userspace Decomposition (Phase C)
Phase C moves the NIC driver and the TCP/IP stack out of the kernel into
separate userspace processes, so the kernel is left with only
DeviceMmio / Interrupt / DMAPool dispatch and the cap-ring transport.
Phase B must be complete first — Phase C is about relocating the code that
Phase B already wrapped in capabilities, not about adding new interfaces at
the socket layer.
Sequencing relative to the cloud usable-instance milestone. The Network-Reachable Datapath Scope Decision (2026-06-02) records that the real-GCE-boot milestone’s “reachable network stack” requirement means raw-frame TX/RX over the live NIC (the polled production provider), which the billable cloudboot gate already checks. The L4 socket reachability that Phase C delivers is therefore a separate future track sequenced after that milestone, not a milestone blocker.
IPv6 Support Status And Task Lane
Current capOS L4 socket behavior has one production forward path: the Phase C
userspace service-object stack. The old qemu-only retained smoltcp runtime that
configured 10.0.2.15/24, installed a default IPv4 route through 10.0.2.2,
resolved the gateway with ARP, and proved outbound ICMPv4 plus TCP HTTP is
retired. Non-qemu production manifests no longer grant the legacy
kernel-owned socket caps; requests for kernel network_manager or
tcp_listen_authority fail at bootstrap instead of falling through to
virtio_stub.rs, and qemu-only kernel TCP/UDP socket entry points fail closed.
The userspace IPv6 lane now has local link-local / Neighbor Discovery, Router
Advertisement / SLAAC, GCE-style DHCPv6 address configuration, ICMPv6 Echo
Reply, and IPv6 TCP listener/connect proofs.
The socket-address ABI is now explicit about address family rather than
overloading a raw four-byte assumption. schema/capos.capnp defines
IpAddressFamily (unspecified / ipv4 / ipv6) and documents a length
contract on every address Data field: empty is unspecified (only where the
method allows it), 4 bytes is ipv4, and 16 bytes is ipv6. getConfig
reports the configured addressFamily and an ipv6Supported flag, so an
all-zero IPv4 config is never misread as an IPv6 state.
kernel/src/cap/network.rs decodes addresses through a family-typed
read_ip_address, accepts IPv4 on the legacy stack, and fails closed on IPv6
there with a distinct ipv6Unsupported-class error and on any other length
with a malformedAddress class – so legacy IPv4-only callers reject IPv6
explicitly instead of treating every non-four-byte value as a generic error.
capos-rt surfaces the family and IPv6-support flag on NetworkConfig. The
wire format stays source-compatible for existing 4-byte IPv4 callers. The
behavior behind the userspace-service ABI now has bounded local IPv6 routing,
diagnostics, and TCP L4 proofs; private GCE reachability and public IPv6
ingress remain unproved.
The pinned userspace smoltcp dependency is version 0.13.0 in the networking
demo crates, not in kernel/Cargo.toml. capOS enables only the features each
userspace proof needs. The crate has IPv6, SLAAC, and ICMP socket features
available, and it does not provide a socket-dhcpv6 feature matching its
DHCPv4 socket. With the address-family ABI landed, remaining IPv6 work is
explicit userspace stack behavior and GCE reachability rather than kernel
feature enablement.
The protocol gap is larger than “turn on IPv6”: with the local link-local/Neighbor Discovery, Router Advertisement / SLAAC, GCE-style DHCPv6, ICMPv6 Echo Reply, and IPv6 TCP listener/connect proofs done, capOS still has no private GCE IPv6 reachability proof or GCE IPv6 firewall proof. The standards and cloud grounding are:
- RFC 4861: Neighbor Discovery, Router Solicitation/Advertisement, address resolution, and router defaults.
- RFC 4862: stateless address autoconfiguration, link-local address generation, and Duplicate Address Detection.
- RFC 4443: ICMPv6 including Echo Request / Echo Reply behavior.
- RFC 8415: DHCPv6 client and server exchanges on UDP 546/547.
- Compute Engine IPv6 configuration:
dual-stack or IPv6-only subnet requirement, one
/96per interface, first/128configured by DHCPv6 from the metadata server, default route via route advertisement, and link-local addresses used for Neighbor Discovery. - Google Cloud VPC firewall rules: IPv6 rules are supported, each firewall rule uses either IPv4 or IPv6 ranges, and IPv6 ingress needs an explicit allow rule before public access is reachable.
The resulting task lane is linked from
Hardware, Boot, and Storage.
The
cloud-prod-ipv6-architecture-status-grounding
scope decision is done (2026-06-03), and the address-family ABI entry point
cloud-prod-network-address-abi-ipv6
is done (2026-06-03) as historical qemu-only kernel socket evidence. That
target is now retired after kernel socket-owner removal; current
address-family/socket behavior is covered by the Phase C userspace IPv4 and
IPv6 gates below.
The local link-local/Neighbor Discovery proof
cloud-prod-ipv6-link-local-nd-local-proof
is done (2026-06-08), proved by make run-cloud-prod-ipv6-link-local-nd.
The local Router Advertisement / SLAAC proof
cloud-prod-ipv6-ra-slaac-local-proof
is done (2026-06-08), proved by make run-cloud-prod-ipv6-ra-slaac.
The local GCE-style DHCPv6 address configuration proof
cloud-prod-ipv6-dhcpv6-gce-config-local-proof
is done (2026-06-08), proved by
make run-cloud-prod-ipv6-dhcpv6-gce-config.
The local ICMPv6 Echo Reply proof
cloud-prod-icmpv6-echo-reply-local-proof
is done (2026-06-08), proved by make run-cloud-prod-icmpv6-echo-reply.
The local IPv6 TCP L4 proof
cloud-prod-ipv6-tcp-l4-local-proof
is done (2026-06-08), proved by make run-cloud-prod-ipv6-tcp-l4.
The lane then sequences private GCE IPv6 and public IPv6 ingress/TLS policy
tasks on top of that userspace-stack substrate.
IPv6 does not block the first public GCE Web UI proof while that proof remains scoped to IPv4 DHCP, ARP, Phase C L4, private GCE reachability, and reviewed public HTTPS ingress. It becomes relevant for a later dual-stack or IPv6-only cloud proof and for public IPv6 ingress policy.
Network Usability, Resolver, And Post-smoltcp Lane
The network usability backlog is
Network Usability and Post-smoltcp.
It records the user-facing work that starts after raw frames and the first
userspace L4 proof: operator status tooling, DHCPv4 lease lifecycle, a typed
system DnsResolver cap, POSIX getaddrinfo bridging, ping/ping6 diagnostics,
socket readiness/cancel/backpressure semantics, packet trace authority, and
transport policy/status.
Current boundaries are explicit there: the first local DHCP/IPv4 configuration
proof is now done by
cloud-prod-network-stack-dhcp-ipv4-config-local-proof
and is on the first GCE Web UI critical path, while DHCP renewal/rebind/expiry,
DNS option publication, and operator-visible lease status remain follow-up
work. The local bounded ICMPv4 Echo Reply proof is also done by
cloud-prod-icmp-echo-reply-local-proof,
proved by make run-cloud-prod-icmp-echo-reply; it answers a bounded local
same-subnet ping and rejects malformed or oversized requests, but it exercises
ICMP protocol logic over an in-process QueuePhyDevice, not the real bound
NIC. The real-NIC inbound path is now also done by
cloud-prod-icmp-echo-reply-real-nic-datapath-local-proof,
proved by make run-cloud-prod-icmp-echo-reply-real-nic-datapath: a kernel-owned
responder on the legacy virtio 0.9 datapath acquires a DHCP lease over the real
NIC, then receives an inbound Echo Request over the real RX vring and transmits
an RFC 792 Echo Reply over the same NIC’s TX vring (a host peer over a QEMU
socket netdev drives the inbound stimulus, since SLIRP drops inbound
host->guest ICMP Echo). Both remain diagnostics rather than Web UI readiness;
the real-NIC proof is the local pre-spend prerequisite for the billable private
GCE ICMP proof and the same responder serves that live run. The POSIX DNS smoke is a hand-rolled
A-query over UdpSocket, not a system resolver service or typed resolver
capability. DNS, operator ping tools, IPv6, packet tracing, and advanced
transport policy are usability/completeness lanes, not first public Web UI
blockers unless a later deployment policy explicitly promotes one.
The backlog keeps smoltcp relocation (Phase C slices 7a-7c: run the selected
smoltcp build in userspace, preserve the socket contract) distinct from
transport policy/status (the capOS control plane around it). The selected
userspace stack is smoltcp 0.13.0 and now has bounded local UDP socket-cap,
TCP listener/socket-cap, sustained receive, and serve-from-userspace production
socket-cap proofs. DHCPv4, DHCPv6, IPv6 L4, and ICMPv6 are explicit protocol
proof lanes rather than ambient production readiness claims; retained qemu-only
fixtures remain separate from the production cloudboot path. The done IPv6
protocol proofs (cloud-prod-ipv6-dhcpv6-gce-config, cloud-prod-ipv6-tcp-l4)
build their smoltcp interface on an in-process HarnessPhyDevice and self-declare
metadata_only=true; the IPv6 datapath over the real bound NIC is now done by
cloud-prod-ipv6-real-nic-datapath-local-proof,
proved by make run-cloud-prod-ipv6-real-nic-datapath: a userspace smoltcp service
on a real-Nic-backed phy (the IPv4 DHCP datapath NicPhyDevice pattern) learns
the default route from a Router Advertisement, configures the GCE-shaped /128
via DHCPv6 Solicit/Advertise/Request/Reply, and completes one ICMPv6 Echo probe –
every frame over Nic.transmit/Nic.receivePoll against a host peer on a QEMU
socket netdev (SLIRP has no stateful DHCPv6 server). That proof records the
real-NIC provenance with no metadata_only/in-process disclaimer and is the local
pre-spend prerequisite for the billable private GCE IPv6 reachability proof. No current capOS
build enables socket-tcp-reno/socket-tcp-cubic, so capOS runs with
CongestionControl::None by build configuration, not as a reviewed policy
choice. The
network-transport-policy-status-decomposition
task records that audit and decomposes read-only transport status, keepalive/
timeout policy inputs, and a deferred congestion-control evaluation gated on
workload evidence.
Architecture
+--------------------------------------------------+
| Application Process |
| holds: TcpSocket cap, UdpSocket cap, ... |
| calls: connect(), send(), recv() via capnp |
+---------------------------+----------------------+
| IPC (capnp messages)
+---------------------------v----------------------+
| Network Stack Process (userspace) |
| smoltcp TCP/IP stack |
| holds: NIC cap (from driver), Timer cap |
| implements: TcpSocket, UdpSocket, Dns caps |
+---------------------------+----------------------+
| IPC (capnp messages)
+---------------------------v----------------------+
| NIC Driver Process (userspace) |
| virtio-net driver |
| holds: DeviceMmio cap, Interrupt cap, DMAPool |
| implements: Nic cap |
+---------------------------+----------------------+
| capability syscalls
+---------------------------v----------------------+
| Kernel |
| DeviceMmio cap: maps BAR into driver process |
| Interrupt cap: routes virtio IRQ to driver |
| DMAPool cap: DMA-eligible frames w/o raw PAs |
| Timer cap: provides monotonic clock |
+--------------------------------------------------+
Three separate processes, each with minimal authority:
- NIC driver — only has access to the specific virtio-net device
registers, its interrupt line, and DMA-eligible frames. Implements the
Nicinterface. - Network stack — holds the
Niccapability from the driver. Runs smoltcp. Implements higher-level socket interfaces. - Application — holds socket capabilities from the network stack. Cannot touch the NIC or raw packets directly.
Phase C prerequisites (beyond Phase B)
| Prerequisite | Owning gate | Why |
|---|---|---|
Interrupt capability | DDF Task 5 + S.11.2 driver-transition gate | NIC driver receives IRQs without ambient authority |
DeviceMmio capability | DDF Task 5 + S.11.2 driver-transition gate | NIC driver accesses device registers under bounded ownership |
DMAPool capability | DDF Task 5 + S.11.1 invariants + S.11.2 gate | DMA-eligible frames without raw physical grants |
| Provider NIC smoke | DDF Task 6 | First end-to-end provider-driver path through reviewed userspace authority instead of the in-kernel ledger |
See DMA Isolation for the concrete invariants the three capabilities must satisfy and the Security Verification Track S.11.2 gate that unblocks moving the NIC driver out of the kernel. DDF Task 5 expands those invariants into a reviewable cap-table and ProcessSpawner manifest surface; DDF Task 6 is the first provider NIC smoke that consumes them end-to-end.
Current Phase C evidence includes the userspace virtio-net driver slices through
the clean independent Nic.transmit/Nic.receive split, the 7a local userspace
smoltcp substrate over that Nic cap, the 7b userspace UDP socket-cap layer,
the 7c-i inter-process UdpSocket proof, the 7c-ii(a) inter-process
TcpListener/TcpSocket proof, the sustained-receive TCP substrate, the
7c-ii(b) local serve-from-userspace production socket-cap proof, and retirement
of the non-qemu legacy kernel socket grant path. The 7c-ii(b) proof starts
the userspace network-stack process as the non-qemu cloudboot init process,
spawns an application client with only Console plus a userspace-served
TcpListenAuthority, and completes one local hostfwd TCP request/response
through served TcpListener/TcpSocket caps. It is still narrower than the
exit criteria below: the proof process keeps the existing
DeviceMmio/DMAPool/Interrupt bring-up caps in-process until the future
driver-service split, the long-lived service shape is still future work, and the
selected GCE Web UI milestone now consumes the done DHCP/IPv4 configuration
proof while still needing the local remote-session Web UI L4 proof, private GCE
reachability, and the tracked Web UI hardening gates. The legacy kernel
cap/network.rs / virtio_stub.rs socket
route is fixture/negative-path cleanup territory, not the architecture to
extend.
Phase C exit criteria
- NIC driver runs in its own userspace process, holding only
DeviceMmio,Interrupt, andDMAPoolcaps. - Network stack runs in a second userspace process, holding only the
Niccap from the driver and aTimercap. - A successor socket-backed terminal or Web UI proof is rebuilt on the userspace network stack; the Phase B Telnet fixture is retired after kernel socket-owner removal.
- The kernel contains no
smoltcpdependency and no virtio-net code on the hot path.
Lower-layer capability schema (drafts — used by Phase C)
Phase B does not expose these to userspace; Phase C does. Timer is already
implemented (see schema/capos.capnp).
Phase C track opened (2026-06-02). The Phase C Userspace NIC Driver Relocation design adopts this inline-
Dataframe ABI as-is (aDmaBuffer-handle zero-copy variant was considered and rejected to keep the change small; the frame stays in a kernel-owned bounce buffer the polled provider already proved). The methods carry the capOSresult/reason/sideEffectevidence triple, andreceivealso reports the observed EtherType. See that doc for the cap-surface gap (no pending security ruling – the writable common-config window extends the accepted notify-doorbell selected-write discipline) and the bounded slice chain.Slice 1 landed (2026-06-02). The unimplemented
Nicinterface below is now inschema/capos.capnpso the later coupled-TX/RX slices (3-4) extend it rather than introduce it; noCapObjectimplements it yet. Slice 1 (cloud-prod-nic-driver-userspace-features-ok-local-proof) also relocated the virtio device handshake to FEATURES_OK into a userspace driver shim over a writable selected-write common-configDeviceMmiowindow (the four handshake registers admitted onDeviceMmio.write32, queue-address writes fail closed); proofmake run-cloud-prod-nic-driver-userspace-features-ok.
The landed Nic schema (inline Data + the capOS evidence triple):
interface Nic {
transmit @0 (frame :Data)
-> (result :Text, reason :Text, sideEffect :Text);
receive @1 ()
-> (frame :Data, observedEthertype :UInt16,
result :Text, reason :Text, sideEffect :Text);
macAddress @2 () -> (addr :Data, result :Text, reason :Text, sideEffect :Text);
linkStatus @3 () -> (up :Bool, result :Text, reason :Text, sideEffect :Text);
}
The driver relocation reuses the production DeviceMmio cap (a read-only BAR
window with selected writes) and Interrupt cap (schema/capos.capnp) rather
than the simplified map/wait sketches earlier drafts of this section used.
Part 4: Cross-cutting
Userspace language runtimes that need sockets
Userspace language runtimes that map their stdlib socket APIs onto capOS
capabilities consume the same TcpSocket/UdpSocket surface this proposal
defines, so the Phase A-B kernel-resident state above is what their socket
imports currently fail closed against:
- The POSIX adapter (
libcapos-posix/) already mapssocket(AF_INET, SOCK_DGRAM, 0)/sendto/recvfrom/closeonto the Phase BUdpSocketcap for the Phase P1.2 Phase B DNS resolver smoke; see Userspace Binaries and POSIX Adapter. - WASI Preview 1
sock_send/sock_recvroute through the WASI host adapter on top of the same caps. Phase W.6 (sockets) remains blocked on socket authority surfacing through the wasm-host CapSet; the W.2ERRNO_NOSYSrefusal harness in Language Support Status and Plans (WASI / WebAssembly row) is the current evidence that no socket authority leaks before that gate.
Neither track changes the trust-boundary debt: socket-using userspace runtimes still depend on the kernel-resident smoltcp stack until Phase C relocates it.
TLS Layering
TLS does not live in this proposal: the TcpSocket here is the
bottom of the transport stack; a TlsSocket wraps it and is
configured from the certificate, trust-store, OCSP, and verifier caps
defined in
Certificates and TLS.
Keys consumed by TLS come from
Cryptography and Key Management.
Draft shape (tracked in the certificates proposal):
interface TlsSocket {
# Client handshake: wrap an outbound TCP socket with a client config.
connect @0 (tcp :TcpSocket, config :TlsClientConfig) -> ();
# Server handshake: accept on a TCP socket with a server config.
accept @1 (tcp :TcpSocket, config :TlsServerConfig) -> ();
send @2 (data :Data) -> (bytesSent :UInt32);
recv @3 (maxLen :UInt32) -> (data :Data);
close @4 () -> ();
peerCertificate @5 () -> (chain :CertificateChain);
alpnSelected @6 () -> (protocol :Text);
}
Open Questions
- DMA memory management. Dedicated
DmaAllocatorcapability vs extendingFrameAllocatorwithallocDma? - Socket readiness model. Phase B uses blocking
accept/recvcalls for the demo. The long-term interface still needs a readiness/poll or cancellation shape for multiplexed services. - Buffer ownership. Copy into IPC message vs shared memory vs capability lending?
References
Crates
- smoltcp —
no_stdTCP/IP stack - virtio-drivers —
no_stdvirtio drivers (rCore project)
Specs
- virtio 1.2 spec — Section 5.1 covers network device
- OSDev Wiki: PCI, Virtio
Prior Art
- rCore — virtio-drivers + smoltcp
- Redox smolnetd — microkernel userspace net stack
- Fuchsia Netstack3 — capability-oriented, userspace, Rust
- Hermit — unikernel with smoltcp + virtio-net