Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Userspace TCP/IP Networking

How capOS gets from “kernel boots” to “userspace process opens a TCP connection.”

The host-local Telnet flow on 127.0.0.1:2323 described in Part 2 was a plaintext, loopback-only research demo, not a shippable Telnet service. It exercised the TerminalSession/SessionManager/AuthorityBroker/RestrictedShellLauncher boundary over a real TCP socket on the path toward the SSH Shell Gateway (see SSH Shell Gateway). That target is now retired because it depended on the removed qemu-only kernel TCP listener. Non-loopback exposure, production credential handling, and any treatment of Telnet as a long-lived service remain out of scope.

Historical trust-boundary debt: Phase A/B kept the smoltcp stack, per-port TCP listener and accepted-socket capability state, UDP socket cap state, line discipline byte handler, and Telnet IAC filter inside the kernel. Phase C has now retired that kernel owner: kernel no longer depends on smoltcp, the qemu-only TCP/UDP socket entry points fail closed, and the run-network-client, run-tcp-listen-authority, run-telnet, and run-posix-dns-smoke fixtures exit with retirement diagnostics. The forward path is the userspace network stack over DeviceMmio/DMAPool/Interrupt authority and typed NIC/socket capabilities. New protocol logic belongs in that Phase C userspace stack.

The Device Driver Foundation now has a bounded provider-consumer proof for one selected virtio-net TX route: a manifest-granted service can compose DMAPool, DeviceMmio, and Interrupt authority, validate the selected bounce-buffer descriptor path, publish a bounded provider-owned queue entry, ring the selected notify doorbell after policy gates, and consume the matching used-ring completion through a route-scoped tx_interrupt.wait event. That is proof coverage for a selected manager-owned route, not Phase C completion. It does not grant full NIC ownership, arbitrary MMIO doorbells, hardware ack/mask/unmask ownership, direct DMA, IOMMU programming, broader completion queue ownership, provider storage/NIC drivers, cloud NIC support, or production networking readiness.

This document has four parts:

  • a historical kernel-internal smoke test that proved virtio-net and smoltcp,
  • historical in-kernel capability interfaces for TCP sockets and the Telnet Shell Demo,
  • userspace decomposition after driver authority capabilities exist, and
  • cross-cutting TLS and open design questions.

Part 1: Kernel-Internal Networking (Phase A)

Prove that capOS can send and receive TCP/IP traffic. Everything runs in-kernel — no IPC, no capability syscalls, no multiple processes needed.

What’s Needed

  1. PCI enumeration — scan config space, find virtio-net device. Uses the standalone PCI/PCIe subsystem described in Cloud Deployment Phase 4 (~200 lines of glue code on top of the shared PCI infrastructure)
  2. virtio-net driver — init virtqueues, send/receive raw Ethernet frames. Use virtio-drivers crate or implement manually (~600-800 lines)
  3. Timer — PIT or LAPIC timer for smoltcp’s poll loop (retransmit timeouts, Instant::now() support). Not a full scheduler — just a monotonic clock (~50-100 lines)
  4. smoltcp integration — implement phy::Device trait over the in-kernel driver, create an Interface with static IP, ICMP ping, then TCP
  5. QEMU flags — add -netdev user,id=n0 -device virtio-net-pci,netdev=n0 to the Makefile

Current implementation status: PCI enumeration, make run-net, modern virtio PCI transport capability discovery, feature negotiation, RX/TX split-virtqueue initialization, descriptor-accounting guard evidence, ARP resolution, and ICMP echo validation are implemented as lower-layer QEMU fixture evidence. The QEMU default device currently appears as transitional 1af4:1000 but exposes standard modern vendor capabilities; capOS accepts it only after finding bounded MMIO common, notify, ISR, and device-specific config regions. The kernel negotiates VIRTIO_F_VERSION_1, VIRTIO_NET_F_MRG_RXBUF, and MAC when safe, allocates kernel-owned DMA pages for the RX/TX queue metadata plus packet buffers, sets DRIVER_OK, submits device-valid TX descriptors, posts RX descriptors, resolves the QEMU user-mode gateway 10.0.2.2 with ARP from static guest address 10.0.2.15, then validates an IPv4 ICMP echo reply from the gateway, including the reply checksums. The former kernel smoltcp adapter, TCP HTTP smoke, and scheduler-polled socket runtime are retired; the make qemu-net-harness path now asserts the lower-layer QEMU fixture evidence instead of a host-backed kernel TCP proof. Current TCP/UDP socket proof lives in the Phase C userspace network-stack gates, including make run-cloud-prod-userspace-network-stack-smoltcp.

Milestones

  • Ping: ICMP echo to QEMU gateway (10.0.2.2 with default user-mode net). Achieved by commit b56a5c1 at 2026-04-24 15:37 UTC.
  • HTTP: TCP connection to a host-side server, send GET, receive response. Achieved by commit a4f1722 at 2026-04-24 16:47 UTC.

Estimated Scope

~1000-1500 lines of new kernel code. ~200 more for TCP on top of ping.

Crate Dependencies

CratePurposeno_std
smoltcpTCP/IP stackyes (features: medium-ethernet, proto-ipv4, socket-tcp)
virtio-driversvirtio device abstractionyes (optional — can implement manually)

Timer Source Decision

Historical Phase B resolution: the scheduler timer advanced the monotonic TICK_COUNT (AtomicU64 in kernel/src/arch/x86_64/context.rs), and the retained kernel smoltcp runtime used that clock instead of a bounded synthetic 10 ms-per-poll clock. Phase C cleanup removed that retained runtime; scheduler ticks no longer poll kernel smoltcp.

Intermediate Tickless Bridge

The retained smoltcp runtime described below is retired. The bridge rules are archival context for why scheduler-polled kernel networking was not acceptable as a long-term tickless/nohz design. Future socket progress belongs in the userspace stack or an IRQ/deadline-driven device path, not in scheduler polling.

#![allow(unused)]
fn main() {
trait NetworkPollClock {
    fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
    fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}

Historical bridge rules:

  • a retained smoltcp runtime would have needed to expose NetworkPollClock before active networking could coexist with tickless idle;
  • the scheduler would have included next_poll_deadline_ns in earliest_global_deadline();
  • poll_until_budget would have been the only scheduler/idle-exit network progress path;
  • the budget would have bounded work done outside ordinary process execution;
  • absent this bridge, active networking would have forced periodic tick;
  • SQPOLL/nohz isolated CPUs would not have run retained network scheduler polling.

QEMU Network Config

ConfigUse case
-netdev user,id=n0 -device virtio-net-pci,netdev=n0Default: NAT, guest reaches host
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0Historical host-local TCP forwarding for the retired Telnet Shell Demo

Part 2: Capability Interfaces — In-Kernel (Phase B)

Phase B turns the Phase A smoke path into first-class TCP capabilities without moving any code out of the kernel. The NetworkManager, TcpListener, and TcpSocket objects become kernel-side CapObjects that user processes invoke through the existing capability ring. The in-kernel smoltcp stack stays where it is; what changes is that it is reached over capability dispatch instead of a hard-coded boot-time call. UDP and raw Nic exposure are not part of this milestone.

Phase B is the first point where a userspace process — the native shell, a boot-package demo, a language runtime — can open a TCP socket. It is also the first point where a visible networking milestone exists at the capability level.

Visible Phase B milestone — Telnet Shell Demo (historical; delivered and later retired with the kernel socket owner). Boot capOS in QEMU with -netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0. Init starts a dedicated telnet-gateway service with scoped port-23 listen authority and restricted shell-launch authority, then gives the child shell only the exact grants described below. On accept, the gateway refuses a bounded initial Telnet option negotiation burst and acts as the terminal host for that connection. It exposes a socket-backed TerminalSession to capos-shell, not a raw TcpSocket, ByteStream, or StdIO replacement for the shell’s existing terminal boundary. From the host:

$ telnet 127.0.0.1 2323
capos login: <anon>
capos$ help
capos$ exit
Connection closed by foreign host.

The same boot proves the shell does not know or care whether its interactive terminal is UART, framebuffer, or TCP-backed Telnet — the TerminalSession provider is interchangeable while the shell-facing authority stays the same. It also exercises the full TCP listener/accept path, not just the outbound connect path used by the Phase A HTTP smoke.

telnet (RFC 854) is deliberate demo wiring: plaintext, no crypto, no authentication of its own. The QEMU target binds the host forward to 127.0.0.1:2323 only and forwards to guest port 23, so the proof is a host-local development demo rather than a remote-access feature. It is not a production access path and will be replaced by the SSH gateway described in SSH Shell Gateway once host-key, user-key, account, audit, and persistence prerequisites are implementable. The value is that Telnet is the cheapest forcing function for a server-side TCP capability and for a socket-backed terminal host. The shell still requires credential verification through the existing login flow (Boot to Shell); the Telnet transport only replaces the physical UART, not the login policy.

Phase B prerequisites

PrerequisiteStateWhy
Capability syscallsStage 4 done (sync)All Nic/socket access goes through the ring
Scheduling + preemptionStage 5 core doneSocket ops block/wake via the scheduler
IPC + capability transferStage 6 3.6 doneListener hands socket caps to the accepting process
Timer capability7.0.0 doneHistorical smoltcp poll clock and socket timeouts; the kernel smoltcp runtime is now retired
Scheduler-driven smoltcp pollretiredThe retained smoltcp runtime was polled from scheduler ticks on real TICK_COUNT; Phase C cleanup removed it
TCP kernel CapObjectsretiredNetworkManager, TcpListener, and TcpSocket previously wrapped the retained smoltcp runtime; qemu-only kernel socket entry points now fail closed
Socket-backed TerminalSession handoffretiredTcpSocket.intoTerminalSession previously consumed a connected socket and returned a move-only TerminalSession cap; rebuild this proof on the userspace network stack before using it as validation
Shell launch bundle handoffretiredtelnet-gateway previously consumed an accepted TcpSocket into a move-only TerminalSession; the gateway demos are removed and remote-shell coverage lives in the in-guest login smokes (run-login, run-default-web-ui)

Phase B does not depend on DeviceMmio, Interrupt, or DMAPool — the NIC driver stays in the kernel. Security Verification Track S.11.2 is a Phase C prerequisite, not a Phase B one.

Phase B schema (kernel CapObjects)

These interfaces are now defined in the canonical shared schema (schema/capos.capnp). The current build pipeline watches and generates bindings for schema/capos.capnp; additional networking schema files remain unnecessary for Phase B.

interface NetworkManager {
    getConfig         @0 () -> (addr :Data, netmask :Data, gateway :Data);
    createTcpListener @1 (port :UInt16) -> (listenerIndex :UInt16);
    connectTcp        @2 (addr :Data, port :UInt16) -> (socketIndex :UInt16);
    # POSIX adapter Phase P1.2 Phase A: bind a UDP socket; the created
    # cap is delivered as a transferred result cap.
    createUdpSocket   @3 (localAddr :Data, localPort :UInt16) -> (socketIndex :UInt16);
}

interface TcpListener {
    accept @0 () -> (socketIndex :UInt16, peerAddr :Data, peerPort :UInt16);
    close  @1 () -> ();
}

interface TcpSocket {
    send                @0 (data :Data) -> (bytesSent :UInt32);
    recv                @1 (maxLen :UInt32) -> (data :Data);
    close               @2 () -> ();
    intoTerminalSession @3 () -> (terminalIndex :UInt16);  # retired; fails closed
}

interface UdpSocket {
    sendTo   @0 (addr :Data, port :UInt16, data :Data) -> (bytesSent :UInt32);
    recvFrom @1 (maxLen :UInt32) -> (addr :Data, port :UInt16, data :Data);
    close    @2 () -> ();
}

Nic stays a separate lower-layer cap (schema shown below) and remains kernel-internal in Phase B. UdpSocket landed for the POSIX adapter Phase P1.2 Phase A DNS path: the kernel implements it on top of the same retained smoltcp runtime, and userspace acquires it through NetworkManager.createUdpSocket. It is not part of the Telnet Shell Demo contract.

The ring transport cannot return direct Cap’n Proto capability fields, so capability-producing methods return result-cap indices in the serialized result and append CapTransferResult records after the message bytes. Runtime clients adopt those result caps by index.

accept and recv are blocking capability calls for the Phase B demo: they complete when a connection or received bytes are available, when the socket is closed, or when the caller’s cap_enter timeout/cancellation path fires. recv(maxLen) clamps to the kernel/ring result-buffer limits, and send may return a partial byte count. A readiness/poll interface can be added later without being required for the first remote shell proof.

Telnet gateway launch contract

This contract is historical: the telnet-gateway demo is removed with the kernel socket owner and the kernel SocketTerminalSession. It is retained as the authority-model reference for any future userspace terminal host. telnet-gateway was the terminal host for the remote connection. Its minimum authority was:

  • Manifest-forwarded TcpListenAuthority badge 23, held by init and forwarded to the gateway as the only listener-creation authority for the demo path.
  • Manifest-forwarded RestrictedShellLauncher, held by init and forwarded to the gateway as the only shell process launch authority.
  • Pass-through grants for the caps the current shell requires at startup: creds, sessions, audit, broker, and system_info.
  • An anonymous UserSession minted through SessionManager and checked through AuthorityBroker.shellBundle("anonymous") before launch. The shell still performs password login inside capos-shell and upgrades the session after credential verification.
  • A way to provide the child shell a cap named terminal whose interface id is TerminalSession, backed by the accepted TCP socket.

The gateway must not grant the child raw NetworkManager, TcpListener, TcpListenAuthority, TcpSocket, broad ProcessSpawner, or RestrictedShellLauncher authority. The retired implementation used the kernel socket wrapper (TcpSocket.intoTerminalSession, now failing closed) to produce an actual TerminalSession CapObject; the shell-facing contract stays TerminalSession for any future userspace terminal host.

Phase B exit criteria

  • schema/capos.capnp defined the TCP types above; kernel implemented them as CapObjects on top of the existing smoltcp interface. Initial implementation landed in commit 7446e04 at 2026-04-25 14:48 UTC; review follow-up added timer-safe deferred completion cleanup and make qemu-network-client-harness userspace coverage for outbound sockets and listener accept. This is historical Phase B evidence; qemu-only kernel socket entry points now fail closed.
  • smoltcp polling was driven from the scheduler, not a synthetic clock, so sockets could survive longer than a single early-boot burst. That runtime is retired.
  • A trusted telnet-gateway boot service used TcpListener/TcpSocket, refused the bounded initial Telnet negotiation needed by normal host clients, and launched capos-shell for the accepted connection with a socket-backed TerminalSession plus the shell’s existing login/session caps. The child shell did not receive raw network, TCP listener/socket, broad spawn, scoped-listener, or restricted-shell-launcher authority. This target is retired.
  • A dedicated CUE manifest (system-telnet.cue) and a make run-telnet target historically booted the above and ran a scripted host-side smoke that completed a login + one command + clean exit over telnet 127.0.0.1 2323. make run-telnet now exits with a retirement diagnostic.

Part 3: Userspace Decomposition (Phase C)

Phase C moves the NIC driver and the TCP/IP stack out of the kernel into separate userspace processes, so the kernel is left with only DeviceMmio / Interrupt / DMAPool dispatch and the cap-ring transport. Phase B must be complete first — Phase C is about relocating the code that Phase B already wrapped in capabilities, not about adding new interfaces at the socket layer.

Sequencing relative to the cloud usable-instance milestone. The Network-Reachable Datapath Scope Decision (2026-06-02) records that the real-GCE-boot milestone’s “reachable network stack” requirement means raw-frame TX/RX over the live NIC (the polled production provider), which the billable cloudboot gate already checks. The L4 socket reachability that Phase C delivers is therefore a separate future track sequenced after that milestone, not a milestone blocker.

IPv6 Support Status And Task Lane

Current capOS L4 socket behavior has one production forward path: the Phase C userspace service-object stack. The old qemu-only retained smoltcp runtime that configured 10.0.2.15/24, installed a default IPv4 route through 10.0.2.2, resolved the gateway with ARP, and proved outbound ICMPv4 plus TCP HTTP is retired. Non-qemu production manifests no longer grant the legacy kernel-owned socket caps; requests for kernel network_manager or tcp_listen_authority fail at bootstrap instead of falling through to virtio_stub.rs, and qemu-only kernel TCP/UDP socket entry points fail closed. The userspace IPv6 lane now has local link-local / Neighbor Discovery, Router Advertisement / SLAAC, GCE-style DHCPv6 address configuration, ICMPv6 Echo Reply, and IPv6 TCP listener/connect proofs.

The socket-address ABI is now explicit about address family rather than overloading a raw four-byte assumption. schema/capos.capnp defines IpAddressFamily (unspecified / ipv4 / ipv6) and documents a length contract on every address Data field: empty is unspecified (only where the method allows it), 4 bytes is ipv4, and 16 bytes is ipv6. getConfig reports the configured addressFamily and an ipv6Supported flag, so an all-zero IPv4 config is never misread as an IPv6 state. kernel/src/cap/network.rs decodes addresses through a family-typed read_ip_address, accepts IPv4 on the legacy stack, and fails closed on IPv6 there with a distinct ipv6Unsupported-class error and on any other length with a malformedAddress class – so legacy IPv4-only callers reject IPv6 explicitly instead of treating every non-four-byte value as a generic error. capos-rt surfaces the family and IPv6-support flag on NetworkConfig. The wire format stays source-compatible for existing 4-byte IPv4 callers. The behavior behind the userspace-service ABI now has bounded local IPv6 routing, diagnostics, and TCP L4 proofs; private GCE reachability and public IPv6 ingress remain unproved.

The pinned userspace smoltcp dependency is version 0.13.0 in the networking demo crates, not in kernel/Cargo.toml. capOS enables only the features each userspace proof needs. The crate has IPv6, SLAAC, and ICMP socket features available, and it does not provide a socket-dhcpv6 feature matching its DHCPv4 socket. With the address-family ABI landed, remaining IPv6 work is explicit userspace stack behavior and GCE reachability rather than kernel feature enablement.

The protocol gap is larger than “turn on IPv6”: with the local link-local/Neighbor Discovery, Router Advertisement / SLAAC, GCE-style DHCPv6, ICMPv6 Echo Reply, and IPv6 TCP listener/connect proofs done, capOS still has no private GCE IPv6 reachability proof or GCE IPv6 firewall proof. The standards and cloud grounding are:

  • RFC 4861: Neighbor Discovery, Router Solicitation/Advertisement, address resolution, and router defaults.
  • RFC 4862: stateless address autoconfiguration, link-local address generation, and Duplicate Address Detection.
  • RFC 4443: ICMPv6 including Echo Request / Echo Reply behavior.
  • RFC 8415: DHCPv6 client and server exchanges on UDP 546/547.
  • Compute Engine IPv6 configuration: dual-stack or IPv6-only subnet requirement, one /96 per interface, first /128 configured by DHCPv6 from the metadata server, default route via route advertisement, and link-local addresses used for Neighbor Discovery.
  • Google Cloud VPC firewall rules: IPv6 rules are supported, each firewall rule uses either IPv4 or IPv6 ranges, and IPv6 ingress needs an explicit allow rule before public access is reachable.

The resulting task lane is linked from Hardware, Boot, and Storage. The cloud-prod-ipv6-architecture-status-grounding scope decision is done (2026-06-03), and the address-family ABI entry point cloud-prod-network-address-abi-ipv6 is done (2026-06-03) as historical qemu-only kernel socket evidence. That target is now retired after kernel socket-owner removal; current address-family/socket behavior is covered by the Phase C userspace IPv4 and IPv6 gates below. The local link-local/Neighbor Discovery proof cloud-prod-ipv6-link-local-nd-local-proof is done (2026-06-08), proved by make run-cloud-prod-ipv6-link-local-nd. The local Router Advertisement / SLAAC proof cloud-prod-ipv6-ra-slaac-local-proof is done (2026-06-08), proved by make run-cloud-prod-ipv6-ra-slaac. The local GCE-style DHCPv6 address configuration proof cloud-prod-ipv6-dhcpv6-gce-config-local-proof is done (2026-06-08), proved by make run-cloud-prod-ipv6-dhcpv6-gce-config. The local ICMPv6 Echo Reply proof cloud-prod-icmpv6-echo-reply-local-proof is done (2026-06-08), proved by make run-cloud-prod-icmpv6-echo-reply. The local IPv6 TCP L4 proof cloud-prod-ipv6-tcp-l4-local-proof is done (2026-06-08), proved by make run-cloud-prod-ipv6-tcp-l4. The lane then sequences private GCE IPv6 and public IPv6 ingress/TLS policy tasks on top of that userspace-stack substrate.

IPv6 does not block the first public GCE Web UI proof while that proof remains scoped to IPv4 DHCP, ARP, Phase C L4, private GCE reachability, and reviewed public HTTPS ingress. It becomes relevant for a later dual-stack or IPv6-only cloud proof and for public IPv6 ingress policy.

Network Usability, Resolver, And Post-smoltcp Lane

The network usability backlog is Network Usability and Post-smoltcp. It records the user-facing work that starts after raw frames and the first userspace L4 proof: operator status tooling, DHCPv4 lease lifecycle, a typed system DnsResolver cap, POSIX getaddrinfo bridging, ping/ping6 diagnostics, socket readiness/cancel/backpressure semantics, packet trace authority, and transport policy/status.

Current boundaries are explicit there: the first local DHCP/IPv4 configuration proof is now done by cloud-prod-network-stack-dhcp-ipv4-config-local-proof and is on the first GCE Web UI critical path, while DHCP renewal/rebind/expiry, DNS option publication, and operator-visible lease status remain follow-up work. The local bounded ICMPv4 Echo Reply proof is also done by cloud-prod-icmp-echo-reply-local-proof, proved by make run-cloud-prod-icmp-echo-reply; it answers a bounded local same-subnet ping and rejects malformed or oversized requests, but it exercises ICMP protocol logic over an in-process QueuePhyDevice, not the real bound NIC. The real-NIC inbound path is now also done by cloud-prod-icmp-echo-reply-real-nic-datapath-local-proof, proved by make run-cloud-prod-icmp-echo-reply-real-nic-datapath: a kernel-owned responder on the legacy virtio 0.9 datapath acquires a DHCP lease over the real NIC, then receives an inbound Echo Request over the real RX vring and transmits an RFC 792 Echo Reply over the same NIC’s TX vring (a host peer over a QEMU socket netdev drives the inbound stimulus, since SLIRP drops inbound host->guest ICMP Echo). Both remain diagnostics rather than Web UI readiness; the real-NIC proof is the local pre-spend prerequisite for the billable private GCE ICMP proof and the same responder serves that live run. The POSIX DNS smoke is a hand-rolled A-query over UdpSocket, not a system resolver service or typed resolver capability. DNS, operator ping tools, IPv6, packet tracing, and advanced transport policy are usability/completeness lanes, not first public Web UI blockers unless a later deployment policy explicitly promotes one.

The backlog keeps smoltcp relocation (Phase C slices 7a-7c: run the selected smoltcp build in userspace, preserve the socket contract) distinct from transport policy/status (the capOS control plane around it). The selected userspace stack is smoltcp 0.13.0 and now has bounded local UDP socket-cap, TCP listener/socket-cap, sustained receive, and serve-from-userspace production socket-cap proofs. DHCPv4, DHCPv6, IPv6 L4, and ICMPv6 are explicit protocol proof lanes rather than ambient production readiness claims; retained qemu-only fixtures remain separate from the production cloudboot path. The done IPv6 protocol proofs (cloud-prod-ipv6-dhcpv6-gce-config, cloud-prod-ipv6-tcp-l4) build their smoltcp interface on an in-process HarnessPhyDevice and self-declare metadata_only=true; the IPv6 datapath over the real bound NIC is now done by cloud-prod-ipv6-real-nic-datapath-local-proof, proved by make run-cloud-prod-ipv6-real-nic-datapath: a userspace smoltcp service on a real-Nic-backed phy (the IPv4 DHCP datapath NicPhyDevice pattern) learns the default route from a Router Advertisement, configures the GCE-shaped /128 via DHCPv6 Solicit/Advertise/Request/Reply, and completes one ICMPv6 Echo probe – every frame over Nic.transmit/Nic.receivePoll against a host peer on a QEMU socket netdev (SLIRP has no stateful DHCPv6 server). That proof records the real-NIC provenance with no metadata_only/in-process disclaimer and is the local pre-spend prerequisite for the billable private GCE IPv6 reachability proof. No current capOS build enables socket-tcp-reno/socket-tcp-cubic, so capOS runs with CongestionControl::None by build configuration, not as a reviewed policy choice. The network-transport-policy-status-decomposition task records that audit and decomposes read-only transport status, keepalive/ timeout policy inputs, and a deferred congestion-control evaluation gated on workload evidence.

Architecture

+--------------------------------------------------+
|  Application Process                             |
|    holds: TcpSocket cap, UdpSocket cap, ...      |
|    calls: connect(), send(), recv() via capnp    |
+---------------------------+----------------------+
                            | IPC (capnp messages)
+---------------------------v----------------------+
|  Network Stack Process (userspace)               |
|    smoltcp TCP/IP stack                          |
|    holds: NIC cap (from driver), Timer cap       |
|    implements: TcpSocket, UdpSocket, Dns caps    |
+---------------------------+----------------------+
                            | IPC (capnp messages)
+---------------------------v----------------------+
|  NIC Driver Process (userspace)                  |
|    virtio-net driver                             |
|    holds: DeviceMmio cap, Interrupt cap, DMAPool |
|    implements: Nic cap                           |
+---------------------------+----------------------+
                            | capability syscalls
+---------------------------v----------------------+
|  Kernel                                          |
|    DeviceMmio cap: maps BAR into driver process  |
|    Interrupt cap: routes virtio IRQ to driver    |
|    DMAPool cap: DMA-eligible frames w/o raw PAs  |
|    Timer cap: provides monotonic clock           |
+--------------------------------------------------+

Three separate processes, each with minimal authority:

  1. NIC driver — only has access to the specific virtio-net device registers, its interrupt line, and DMA-eligible frames. Implements the Nic interface.
  2. Network stack — holds the Nic capability from the driver. Runs smoltcp. Implements higher-level socket interfaces.
  3. Application — holds socket capabilities from the network stack. Cannot touch the NIC or raw packets directly.

Phase C prerequisites (beyond Phase B)

PrerequisiteOwning gateWhy
Interrupt capabilityDDF Task 5 + S.11.2 driver-transition gateNIC driver receives IRQs without ambient authority
DeviceMmio capabilityDDF Task 5 + S.11.2 driver-transition gateNIC driver accesses device registers under bounded ownership
DMAPool capabilityDDF Task 5 + S.11.1 invariants + S.11.2 gateDMA-eligible frames without raw physical grants
Provider NIC smokeDDF Task 6First end-to-end provider-driver path through reviewed userspace authority instead of the in-kernel ledger

See DMA Isolation for the concrete invariants the three capabilities must satisfy and the Security Verification Track S.11.2 gate that unblocks moving the NIC driver out of the kernel. DDF Task 5 expands those invariants into a reviewable cap-table and ProcessSpawner manifest surface; DDF Task 6 is the first provider NIC smoke that consumes them end-to-end.

Current Phase C evidence includes the userspace virtio-net driver slices through the clean independent Nic.transmit/Nic.receive split, the 7a local userspace smoltcp substrate over that Nic cap, the 7b userspace UDP socket-cap layer, the 7c-i inter-process UdpSocket proof, the 7c-ii(a) inter-process TcpListener/TcpSocket proof, the sustained-receive TCP substrate, the 7c-ii(b) local serve-from-userspace production socket-cap proof, and retirement of the non-qemu legacy kernel socket grant path. The 7c-ii(b) proof starts the userspace network-stack process as the non-qemu cloudboot init process, spawns an application client with only Console plus a userspace-served TcpListenAuthority, and completes one local hostfwd TCP request/response through served TcpListener/TcpSocket caps. It is still narrower than the exit criteria below: the proof process keeps the existing DeviceMmio/DMAPool/Interrupt bring-up caps in-process until the future driver-service split, the long-lived service shape is still future work, and the selected GCE Web UI milestone now consumes the done DHCP/IPv4 configuration proof while still needing the local remote-session Web UI L4 proof, private GCE reachability, and the tracked Web UI hardening gates. The legacy kernel cap/network.rs / virtio_stub.rs socket route is fixture/negative-path cleanup territory, not the architecture to extend.

Phase C exit criteria

  • NIC driver runs in its own userspace process, holding only DeviceMmio, Interrupt, and DMAPool caps.
  • Network stack runs in a second userspace process, holding only the Nic cap from the driver and a Timer cap.
  • A successor socket-backed terminal or Web UI proof is rebuilt on the userspace network stack; the Phase B Telnet fixture is retired after kernel socket-owner removal.
  • The kernel contains no smoltcp dependency and no virtio-net code on the hot path.

Lower-layer capability schema (drafts — used by Phase C)

Phase B does not expose these to userspace; Phase C does. Timer is already implemented (see schema/capos.capnp).

Phase C track opened (2026-06-02). The Phase C Userspace NIC Driver Relocation design adopts this inline-Data frame ABI as-is (a DmaBuffer-handle zero-copy variant was considered and rejected to keep the change small; the frame stays in a kernel-owned bounce buffer the polled provider already proved). The methods carry the capOS result/reason/sideEffect evidence triple, and receive also reports the observed EtherType. See that doc for the cap-surface gap (no pending security ruling – the writable common-config window extends the accepted notify-doorbell selected-write discipline) and the bounded slice chain.

Slice 1 landed (2026-06-02). The unimplemented Nic interface below is now in schema/capos.capnp so the later coupled-TX/RX slices (3-4) extend it rather than introduce it; no CapObject implements it yet. Slice 1 (cloud-prod-nic-driver-userspace-features-ok-local-proof) also relocated the virtio device handshake to FEATURES_OK into a userspace driver shim over a writable selected-write common-config DeviceMmio window (the four handshake registers admitted on DeviceMmio.write32, queue-address writes fail closed); proof make run-cloud-prod-nic-driver-userspace-features-ok.

The landed Nic schema (inline Data + the capOS evidence triple):

interface Nic {
    transmit @0 (frame :Data)
        -> (result :Text, reason :Text, sideEffect :Text);
    receive  @1 ()
        -> (frame :Data, observedEthertype :UInt16,
            result :Text, reason :Text, sideEffect :Text);
    macAddress @2 () -> (addr :Data, result :Text, reason :Text, sideEffect :Text);
    linkStatus @3 () -> (up :Bool, result :Text, reason :Text, sideEffect :Text);
}

The driver relocation reuses the production DeviceMmio cap (a read-only BAR window with selected writes) and Interrupt cap (schema/capos.capnp) rather than the simplified map/wait sketches earlier drafts of this section used.

Part 4: Cross-cutting

Userspace language runtimes that need sockets

Userspace language runtimes that map their stdlib socket APIs onto capOS capabilities consume the same TcpSocket/UdpSocket surface this proposal defines, so the Phase A-B kernel-resident state above is what their socket imports currently fail closed against:

  • The POSIX adapter (libcapos-posix/) already maps socket(AF_INET, SOCK_DGRAM, 0)/sendto/recvfrom/close onto the Phase B UdpSocket cap for the Phase P1.2 Phase B DNS resolver smoke; see Userspace Binaries and POSIX Adapter.
  • WASI Preview 1 sock_send / sock_recv route through the WASI host adapter on top of the same caps. Phase W.6 (sockets) remains blocked on socket authority surfacing through the wasm-host CapSet; the W.2 ERRNO_NOSYS refusal harness in Language Support Status and Plans (WASI / WebAssembly row) is the current evidence that no socket authority leaks before that gate.

Neither track changes the trust-boundary debt: socket-using userspace runtimes still depend on the kernel-resident smoltcp stack until Phase C relocates it.

TLS Layering

TLS does not live in this proposal: the TcpSocket here is the bottom of the transport stack; a TlsSocket wraps it and is configured from the certificate, trust-store, OCSP, and verifier caps defined in Certificates and TLS. Keys consumed by TLS come from Cryptography and Key Management.

Draft shape (tracked in the certificates proposal):

interface TlsSocket {
    # Client handshake: wrap an outbound TCP socket with a client config.
    connect @0 (tcp :TcpSocket, config :TlsClientConfig) -> ();
    # Server handshake: accept on a TCP socket with a server config.
    accept  @1 (tcp :TcpSocket, config :TlsServerConfig) -> ();
    send    @2 (data :Data) -> (bytesSent :UInt32);
    recv    @3 (maxLen :UInt32) -> (data :Data);
    close   @4 () -> ();
    peerCertificate @5 () -> (chain :CertificateChain);
    alpnSelected    @6 () -> (protocol :Text);
}

Open Questions

  1. DMA memory management. Dedicated DmaAllocator capability vs extending FrameAllocator with allocDma?
  2. Socket readiness model. Phase B uses blocking accept/recv calls for the demo. The long-term interface still needs a readiness/poll or cancellation shape for multiplexed services.
  3. Buffer ownership. Copy into IPC message vs shared memory vs capability lending?

References

Crates

Specs

Prior Art

QEMU